{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "QJb3sp4ZCF_O"
   },
   "source": [
    "# Data Preparation\n",
    "\n",
    "Notebook supporting the [**Do we know our data, as good as we know our tools** talk](https://devoxxuk19.confinabox.com/talk/VEM-8021/Do_we_know_our_data_as_good_as_we_know_our_tools%3F) at [Devoxx UK 2019](http://twitter.com/@DevoxxUK).\n",
    "\n",
    "The contents of the notebook is inspired by many sources.\n",
    "\n",
    "\n",
    "### High-level steps covered:\n",
    "\n",
    "- Data Cleaning\n",
    "  - Deal with errors \n",
    "  - Deal with duplicates\n",
    "  - Deal with outliers [DEMO - WALKTHRU]\n",
    "  - Deal with missing data [DEMO - WALKTHRU]\n",
    "- Deal with too much data\n",
    "\n",
    "\n",
    "## Resources\n",
    "\n",
    "### Data cleaning\n",
    "- [Data cleaning](https://elitedatascience.com/data-cleaning)\n",
    "- [Spend Less Time Cleaning Data with Machine Learning](https://www.dataversity.net/spend-less-time-cleaning-data-with-machine-learning/#)\n",
    "- [Helpful Python Code Snippets for Data Exploration in Pandas - lots of python snippets to select / clean / prepare](https://medium.com/@msalmon00/helpful-python-code-snippets-for-data-exploration-in-pandas-b7c5aed5ecb9)\n",
    "- [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)\n",
    "- [Journal of Statistical Software - TidyData](https://www.jstatsoft.org/article/view/v059i10/)\n",
    "\n",
    "### Data preprocessing / Data Wrangling\n",
    "- [Data Preprocessing vs. Data Wrangling in Machine Learning Projects](https://www.infoq.com/articles/ml-data-processing)\n",
    "- [Improve Model Accuracy with Data Pre-Processing](https://machinelearningmastery.com/improve-model-accuracy-with-data-pre-processing/)\n",
    "- **[Useful cheatsheets](https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/README-details.md#cheatsheets)**\n",
    "\n",
    "\n",
    "Please refer to the [Slides](http://bit.ly/do-we-know-our-data) for the step here after."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Why?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ask all the questions you should ask with regards to the domain and related domains or sub-domains.\n",
    "\n",
    "It is a good idea to know the **why** part of the action, why are we doing what we are doing with the data, see the [five whys](https://en.wikipedia.org/wiki/5_Whys).\n",
    "\n",
    "Some ideas (of course, please come up with your own as well):\n",
    "\n",
    "- Garbage in, garbage out. If you work with dirty data, even the most\n",
    "sophisticated models won’t be able to get satisfying results. Better\n",
    "data beats fancier algorithms\n",
    "- To create a clean dataset (so that it has good enough accuracy\n",
    "and correctness)\n",
    "- So that we can create models that are closer to nature’s model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "4UtaBjaqq-v7"
   },
   "source": [
    "#### Load Your Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Archive:  boston_housing_dataset.zip\n",
      "  inflating: column.header           \n",
      "  inflating: housing-unclean.csv     \n",
      "  inflating: housing.csv             \n",
      "  inflating: housing.names           \n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "if [[ ! -s boston_housing_dataset.zip ]]; then\n",
    "    curl -O -L https://github.com/neomatrix369/awesome-ai-ml-dl/releases/download/v0.1/boston_housing_dataset.zip\n",
    "fi\n",
    "\n",
    "unzip -o boston_housing_dataset.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 967
    },
    "colab_type": "code",
    "id": "p0JJ2OzsrECo",
    "outputId": "70b89c5c-ec3a-4af0-c1ff-f1f1c8610e8a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Names and descriptions of the fields of the Boston Housing dataset can be found at\n",
      "https://github.com/jbrownlee/Datasets/blob/master/housing.names\n",
      "\n",
      "1. Title: Boston Housing Data\n",
      "\n",
      "2. Sources:\n",
      "   (a) Origin:  This dataset was taken from the StatLib library which is\n",
      "                maintained at Carnegie Mellon University.\n",
      "   (b) Creator:  Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the \n",
      "                 demand for clean air', J. Environ. Economics & Management,\n",
      "                 vol.5, 81-102, 1978.\n",
      "   (c) Date: July 7, 1993\n",
      "\n",
      "3. Past Usage:\n",
      "   -   Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, \n",
      "       1980.   N.B. Various transformations are used in the table on\n",
      "       pages 244-261.\n",
      "    -  Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning.\n",
      "       In Proceedings on the Tenth International Conference of Machine \n",
      "       Learning, 236-243, University of Massachusetts, Amherst. Morgan\n",
      "       Kaufmann.\n",
      "\n",
      "4. Relevant Information:\n",
      "\n",
      "   Concerns housing values in suburbs of Boston.\n",
      "\n",
      "5. Number of Instances: 506\n",
      "\n",
      "6. Number of Attributes: 13 continuous attributes (including \"class\"\n",
      "                         attribute \"MEDV\"), 1 binary-valued attribute.\n",
      "\n",
      "7. Attribute Information:\n",
      "\n",
      "    1. CRIM      per capita crime rate by town\n",
      "    2. ZN        proportion of residential land zoned for lots over \n",
      "                 25,000 sq.ft.\n",
      "    3. INDUS     proportion of non-retail business acres per town\n",
      "    4. CHAS      Charles River dummy variable (= 1 if tract bounds \n",
      "                 river; 0 otherwise)\n",
      "    5. NOX       nitric oxides concentration (parts per 10 million)\n",
      "    6. RM        average number of rooms per dwelling\n",
      "    7. AGE       proportion of owner-occupied units built prior to 1940\n",
      "    8. DIS       weighted distances to five Boston employment centres\n",
      "    9. RAD       index of accessibility to radial highways\n",
      "    10. TAX      full-value property-tax rate per $10,000\n",
      "    11. PTRATIO  pupil-teacher ratio by town\n",
      "    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks \n",
      "                 by town\n",
      "    13. LSTAT    % lower status of the population\n",
      "    14. MEDV     Median value of owner-occupied homes in $1000's\n",
      "\n",
      "8. Missing Attribute Values:  None.\n",
      "\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "import random\n",
    "\n",
    "sns.set()\n",
    "\n",
    "names = [line.strip() for line in open(\"column.header\", 'r')]\n",
    "data = pd.read_csv(\"housing-unclean.csv\", names=names)\n",
    "\n",
    "print(\"Names and descriptions of the fields of the Boston Housing dataset can be found at\")\n",
    "print(\"https://github.com/jbrownlee/Datasets/blob/master/housing.names\")\n",
    "print(\"\")\n",
    "!cat housing.names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "qIhcY8dYE6QZ"
   },
   "source": [
    "### Data Cleaning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "pMktiylmHncM"
   },
   "source": [
    "- deal with errors\n",
    "- deal with duplicates\n",
    "- deal with outliers\n",
    "- deal with missing data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "OH3OVpoBJRtl"
   },
   "source": [
    "#### Deal with errors\n",
    "\n",
    "Also known as structural errors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Tyd1ZzUre9lj"
   },
   "source": [
    "|Type of problems |Technique to use|\n",
    "|-----------------------------|---------------------------|\n",
    "| mislabelled | relabel data automatically or manually |\n",
    "|----------------------------------------------------------|-------------------------------------------------------------|\n",
    "| dataset standardisation issue | uniformly replace them |\n",
    "|----------------------------------------------------------|-------------------------------------------------------------|\n",
    "| sync issues between sources of data | standardise the data |\n",
    "|----------------------------------------------------------|-------------------------------------------------------------|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Mm5EsDitJXgQ"
   },
   "source": [
    "####  Deal with duplicates\n",
    "\n",
    "Get stats on the number of non-unqiue or duplicate rows in a dataset and decide if you would like to delete them. In most case you would delete them.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 2201
    },
    "colab_type": "code",
    "id": "WjnQpQCFJq4G",
    "outputId": "e38c4176-5464-4a7e-c46c-ac1dc19ebc76"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset rows count BEFORE dropping duplicates: 606\n",
      "Duplicated rows count: 50\n",
      "% of duplicated rows to total rows in the dataset: 8.25082508250825\n",
      "\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYEAAAELCAYAAAA/cjqaAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAHBFJREFUeJzt3XmYXFWd//F3dwJJJA1I00HDlhmcfEQMIAgBRUGWQRBER+UhmgT5DWoU4XFDQEAYRjEKjMjiJII/VgFFBQXxpxJZDJsMEiEwflkkEPZOA5IAYUn3749zWiqxu6u609VdnfN5PU8/XXXPXU7dU3U/99x7q25TV1cXZmZWpubhroCZmQ0fh4CZWcEcAmZmBXMImJkVzCFgZlYwh4CZWcEcArbGkDRH0vF9lJ8o6eLVmP/1kg4d6PQV85kkqUvS6Pz815IOXt35DkK9uiS9ZbjrYUNr9HBXwOpD0iJgI+A1YAVwL3Ah8IOI6Kxh+knAQ8BaEfFaHes5aMuJiFkV890NuDgiNlmtCg6BiNhndech6ZPAoRGxy+rXyErinsCabf+IaAE2B2YDRwE/HN4q2UgnqUmStx1rCPcEChARfwN+KelJ4FZJp0XEQkkfAL4BbAH8DfhhRJyYJ7sx/39OEsBewNPAOcA2QBfwG+CwiHgOQNJRwBHAusDjwOciYl7eYHwV+BSwPjAPmBURz/S0nIi4pbvuksYCzwKbRsQSSccC/wFsEBHPS/pPoCUiviDpfOBR4FvAr4ExkpblWU3O/9eWdCHwYeAR4OCI+J+e1pukvYAzgTcDFwFNFWUnAm+JiOn5+SQqejSSrgduAfYA3gpcBxySX/Oqy7me1Gs5Nz//FPAlYBNgMTA9Iv4k6ei8Difk4cdGxBWStgTmAGvl1/taRKwvaQzwTeBAYAxwBfDFiHgpL+fIvJwu4Lie1sEqdbwJ2A3YDpgi6cW83F2AZ4BvR8Q5/WyzfYFTgU2B54HvRsSpfdXFBpfTvCAR8UfSRvI9edALwEzShvkDwGclfSiXvTf/Xz8ixucNcxNpAzsR2JL0wT0RQGkL/nlgh9z72BtYlOdxOPAhYNc87bPA2X0sp7LOy4Hb87Tk/w8D7654fsMq07wA7AM8nuc5PiIez8UfBC7Lr/mXwFk9rStJGwI/J20cNwQerFhmrWYC/4cUIq8BZ1SbQNLHSOt0JilMPwh05OIHSW23HmmjerGkN0fE/wKzgFvya10/jz+bFH7bAm8BNga+npfzfuArpHD/F2DPGl7PDODTQAupDS4jvZ8mAh8FTpa0ez/b7IfAZ/J75u3A72uohw0i9wTK8ziwAUBEXF8x/C5Jl5I+oFf2NGFEPAA8kJ+2S/ov4IT8fAVpb/NtktojYlHFpLOAz0fEo/D3vehHJM2osc43ALtK+gWwNSmIdpV0HbADr/cmajE/Iq7J9bgI+EIv4+0L3BMRP83jng58uR/LAbgoIhbm6Y8HFtRwAvhQ4DsRcXt+3r2+iYjLK8b7saRjgB2BX6w6E0lNpA321t29D0knA5cAx5B6B+dV1O9EYFqVup0fEffk8TclbdQ/kDf6CySdSwqv31N7m71Kes/8OSKeJe0g2BByCJRnY1LXHUlTSXuLbwfWJm3EL+9tQkkbAd8j7Y22kHqSz0IKCElfIO3FbiXpN8CX8h745sAVkipPSK8gnbiuxQ3Af5EOQ9wN/I60B7kT8EBEdPQx7aqerHj8IjBW0ugeTkpPJB1yASAiuiQtpn8qx38YWIvUq+jLpqQ9/n8gaSbp8M2kPGh8H/NrA94A3JEPs0HqyY3KjycCd6xSv2oqX89E4JmIWLrKPN6ZH9faZh8h9bZmS7oLOHrV3qDVlw8HFUTSDqQQmJ8HXUI6JLJpRKxHOr7bfdy7p5+XPTkPnxIR6wLTK8YnIi7JV6dsnsf7di5aDOwTEetX/I2NiMd6Wc6qbgZEOo5/Q0TcC2xG2lu/oZdpVvfncZ8gbZCBv+9Zb1pR/gJpI9vtTT3Mo3L8zUh7vUuqLHcx6RzNSiRtTjof83mgNR/yWUjv7bUEeAnYqmKdrxcR43P5Sq8v16+aymU8DmwgqWWVeTyWH9fUZhFxe0QcQDrPcSXwkxrqYYPIIVAASetK2o90DPfiiLg7F7WQ9uaWS9oR+HjFZO1AJ/DPFcNagGXA3yRtDBxZsQxJ2j2fjFxO2gB17/nPAb6ZN2RIapN0QB/LWUlEvEjaaz2M1zcgN5MOM/UWAk8BrZLW622+VfyK1KP5t3w9/xGsvKFfALxX0mZ5Gcf0MI/pkt4m6Q3AScBPI2JFleWeC3xF0vb5Kpy35PW2Dmkj3A4g6RBSD67bU8AmktYGyJcBnwN8V9KEPM3GkvbO4/8E+GRF/U6gHyJiMakNviVprKStgX8HLs7lVdtM0tqSPiFpvYh4lXRiuOrlyza4HAJrtqskLSVfSULqnh9SUf454KQ8ztep2AvLH+JvAjdJek7STqSTkduRriT6FenEabcxpENLS0iHXCbw+obxe6Qex2/zsm4FpvaxnJ7cQDqc8seK5y30cj4gIv4CXAr8Nc93Ym8rqZfplwAfy6+pg3Ty9KaK8t8BPwbuIm3sru5hNhcB55PWx1hSkFRb7uWk9XEJsJS0d7xB3pM+jXTF0VPAlMr6kI7D3wM8Kam7t3EU6ZzCrZKeB64l7Z0TEb8GTs/TPcDATshOIx2aepx05dEJEXFtRXktbTYDWJTrNwv4xADqYauhyTeVMRt8q172adao3BMwMyuYQ8DMrGA+HGRmVjD3BMzMCtaoXxYbQ/pW4ROkLxWZmVl1o0g/U3I78HItEzRqCOwA/GG4K2FmNkK9h9e/FNqnRg2BJwCeffYFOjvX7HMWra3j6ehYVn1Eaxhus5GnlDZrbm7ijW9cB/I2tBaNGgIrADo7u9b4EACKeI1rGrfZyFNYm9V8GN0nhs3MCuYQMDMrmEPAzKxgDgEzs4I5BMzMCuYQMDMrWE2XiEoaC3yXdDPq5aQbWn9a0mTgAqCV9JvrMyPi/jxNr2VmZtYYav2ewHdIG//J+V6r3feGnQOcHREXS5oOzAV2r6GsrlrWHcfYMY36FYh/1NbWUn2kBrH85ddY+vxLw10NMxskVbeUksYDM4FNIqILICKeyres2w7YK496KXCWpDbSfU97LIuI9kF+Df9g7JjR7P/lX9R7MUW66rQDWFp9NDMbIWrZXd6CdDjnBEnvI91j9jjSPWQf675nakSskPQ46ebVTX2U1RwCra3jq49kQ24k9Vzqxetg5HGb9ayWEBhFugn4nRFxpKSpwFWk+6/WVUfHsgF91duNXV/t7WX3BdraWopfByNNKW3W3NzU753nWq4OegR4jXRIh4i4jXQz8ZeAjSWNAsj/J5Juar64jzIzM2sQVUMgIpYA15GP7+erfiYA9wELgGl51Gmk3kJ7RDzdW9ngVt/MzFZHrd8TmAV8TdLdwGXAjIh4Lg8/XNJ9wOH5eeU0vZWZmVkDqOk6yoj4K7BbD8P/AkztZZpey8zMrDH4G8NmZgVzCJiZFcwhYGZWMIeAmVnBHAJmZgVzCJiZFcwhYGZWMIeAmVnBHAJmZgUbOXdeMbOGMdJu3AQj59eFh/rGTSOrFc2sIfjGTfUz1Ddu8uEgM7OCOQTMzArmEDAzK5hDwMysYA4BM7OCOQTMzArmEDAzK5hDwMysYA4BM7OCOQTMzArmEDAzK5hDwMysYA4BM7OCOQTMzApW009JS1oELM9/AEdFxG8k7QTMBcYBi4DpEfF0nqbXMjMzawz96Ql8NCK2zX+/kdQMXAwcFhGTgRuB2QB9lZmZWeNYncNB2wPLI2J+fj4HOLCGMjMzaxD9ubPYjyQ1AfOBrwGbAQ93F0bEEknNkjboqywinql1ga2t4/tRPRsqI+U2ffXkdWD1NJTvr1pD4D0RsVjSGOB04CzgivpVK+noWEZnZ1e/p/MHtL7a24fy5neNp62txevAn7G6Guj7q7m5qd87zzUdDoqIxfn/y8D3gXcDjwCbd48jaUOgM+/p91VmZmYNomoISFpH0nr5cRNwELAAuAMYJ2mXPOos4PL8uK8yMzNrELUcDtoI+JmkUcAo4F7gcxHRKWkGMFfSWPJloAB9lZmZWeOoGgIR8VfgHb2U3QxM6W+ZmZk1Bn9j2MysYA4BM7OCOQTMzArmEDAzK5hDwMysYA4BM7OCOQTMzArmEDAzK5hDwMysYA4BM7OCOQTMzArmEDAzK5hDwMysYA4BM7OCOQTMzArmEDAzK5hDwMysYA4BM7OCOQTMzArmEDAzK5hDwMysYA4BM7OCOQTMzArmEDAzK5hDwMysYKP7M7KkE4ATgSkRsVDSTsBcYBywCJgeEU/ncXstMzOzxlBzT0DSdsBOwMP5eTNwMXBYREwGbgRmVyszM7PGUVMISBoDnA18tmLw9sDyiJifn88BDqyhzMzMGkStPYGTgIsjYlHFsM3IvQKAiFgCNEvaoEqZmZk1iKrnBCTtDLwTOLr+1VlZa+v4oV6k1aCtrWW4qzDsvA6snoby/VXLieFdgS2BhyQBbAL8BjgD2Lx7JEkbAp0R8YykR3or60/lOjqW0dnZ1Z9JAH9A6629felwV2FYtbW1eB34M1ZXA31/NTc39XvnuerhoIiYHRETI2JSREwCHgX2Bk4BxknaJY86C7g8P76jjzIzM2sQA/6eQER0AjOA/5Z0P6nHcHS1MjMzaxz9+p4AQO4NdD++GZjSy3i9lpmZWWPwN4bNzArmEDAzK5hDwMysYA4BM7OCOQTMzArmEDAzK5hDwMysYA4BM7OCOQTMzArmEDAzK5hDwMysYA4BM7OCOQTMzArmEDAzK5hDwMysYA4BM7OCOQTMzArmEDAzK5hDwMysYA4BM7OCOQTMzArmEDAzK5hDwMysYA4BM7OCOQTMzAo2upaRJF0J/BPQCSwDDo+IBZImAxcArUAHMDMi7s/T9FpmZmaNodaewMERsU1EvAM4Ffi/efgc4OyImAycDcytmKavMjMzawA1hUBE/K3i6XpAp6QJwHbApXn4pcB2ktr6KhucapuZ2WCo+ZyApHMlPQJ8EzgY2BR4LCJWAOT/j+fhfZWZmVmDqOmcAEBEHAogaQZwCnB8vSrVrbV1fL0XYQPQ1tYy3FUYdl4HVk9D+f6qOQS6RcRFkn4APApsLGlURKyQNAqYCCwGmvooq1lHxzI6O7v6W0V/QOusvX3pcFdhWLW1tXgd+DNWVwN9fzU3N/V757nq4SBJ4yVtWvF8f+AZ4GlgATAtF00D7oyI9ojotaxftTMzs7qqpSewDnC5pHWAFaQA2D8iuiTNAi6Q9HXgWWBmxXR9lZmZWQOoGgIR8RSwUy9lfwGm9rfMzMwag78xbGZWMIeAmVnBHAJmZgVzCJiZFcwhYGZWMIeAmVnBHAJmZgVzCJiZFcwhYGZWMIeAmVnBHAJmZgVzCJiZFcwhYGZWMIeAmVnBHAJmZgVzCJiZFcwhYGZWMIeAmVnBHAJmZgVzCJiZFcwhYGZWMIeAmVnBHAJmZgVzCJiZFcwhYGZWsNHVRpDUClwEbAG8AtwPfCYi2iXtBMwFxgGLgOkR8XSertcyMzNrDLX0BLqA70SEImIK8CAwW1IzcDFwWERMBm4EZgP0VWZmZo2jaghExDMRcX3FoFuBzYHtgeURMT8PnwMcmB/3VWZmZg2iX+cE8h7+Z4FfApsBD3eXRcQSoFnSBlXKzMysQVQ9J7CKM4FlwFnAhwe/OitrbR1f70XYALS1tQx3FYad14HV01C+v2oOAUmnAv8C7B8RnZIeIR0W6i7fEOiMiGf6KutP5To6ltHZ2dWfSQB/QOutvX3pcFdhWLW1tXgd+DNWVwN9fzU3N/V757mmw0GSTiYd5/9QRLycB98BjJO0S34+C7i8hjIzM2sQtVwiuhVwDHAfcLMkgIci4sOSZgBzJY0lXwYKkHsKPZaZmVnjqBoCEXEP0NRL2c3AlP6WmZlZY/A3hs3MCuYQMDMrmEPAzKxgDgEzs4I5BMzMCuYQMDMrmEPAzKxgDgEzs4I5BMzMCuYQMDMrmEPAzKxgDgEzs4I5BMzMCuYQMDMrmEPAzKxgDgEzs4I5BMzMCuYQMDMrmEPAzKxgDgEzs4I5BMzMCuYQMDMrmEPAzKxgDgEzs4I5BMzMCja62giSTgU+AkwCpkTEwjx8MnAB0Ap0ADMj4v5qZWZm1jhq6QlcCbwXeHiV4XOAsyNiMnA2MLfGMjMzaxBVQyAi5kfE4sphkiYA2wGX5kGXAttJauurbPCqbWZmg2Gg5wQ2BR6LiBUA+f/jeXhfZWZm1kCqnhMYTq2t44e7CtaDtraW4a7CsPM6sHoayvfXQENgMbCxpFERsULSKGBiHt7UR1m/dHQso7Ozq9+V8we0vtrblw53FYZVW1uL14E/Y3U10PdXc3NTv3eeB3Q4KCKeBhYA0/KgacCdEdHeV9lAlmVmZvVTNQQknSHpUWAT4FpJ9+SiWcDhku4DDs/PqaHMzMwaRNXDQRFxBHBED8P/AkztZZpey8zMrHH4G8NmZgVzCJiZFcwhYGZWMIeAmVnBHAJmZgVzCJiZFcwhYGZWMIeAmVnBHAJmZgVzCJiZFcwhYGZWMIeAmVnBGvqmMlaOlnXHMXbMyHk7jqTf01/+8mssff6l4a6GNaiR86mzNdrYMaPZ/8u/GO5qrJGuOu0Ayr4FjvXFh4PMzArmEDAzK5hDwMysYA4BM7OCOQTMzArmEDAzK5hDwMysYA4BM7OCOQTMzArmEDAzK5hDwMysYA4BM7OC1fUH5CRNBi4AWoEOYGZE3F/PZZqZWe3q3ROYA5wdEZOBs4G5dV6emZn1Q916ApImANsBe+VBlwJnSWqLiPYqk48CaG5uGvDyJ7xx3ICntb6tTrv0xW1WP/VoM7dX/Qy0vSqmG1XrNE1dXV0DWlg1krYHLoyIrSqG3QtMj4g/VZl8F+APdamYmdma7z3A/FpGbNSbytxOehFPACuGuS5mZiPFKODNpG1oTeoZAouBjSWNiogVkkYBE/Pwal6mxhQzM7OVPNifket2YjgingYWANPyoGnAnTWcDzAzsyFSt3MCAJLeSrpE9I3As6RLRKNuCzQzs36pawiYmVlj8zeGzcwK5hAwMyuYQ8DMrGAOATOzgjXql8WGnKTbgDHA2sBkYGEuujMiDuljut2B5oi4toZlHArsGREHDUKV11iSFgHLSd8XWQe4B/h2RNy8mvM9H/ifiDhL0ixgXER8d4Dz2g1YOyJ+uzp1WtNJ6gJaImJZxbAlwDsjYlGVaa8BDo+Ifl33bv3jEMgiYiqApEmkDcW2NU66O2k9Vg2BWkkaHRGvDdb8RqiPRsRCAEn/Blwjae+IuG0wZh4Rc1ZzFrsB44HVCgG3de8iYt/hWK6kZqArIoq4dNIhUCNJXwM+np/eBhxO6jEcCjRJej/wI+AM4CrSz2ePBW4FZkXEq1XmP5/0Ve+dgXZgf0mHAF8CuoD783zaJd0OfDoi7pT0A2BqRGwjaS3gSWATYHvgTKCJ1M4nRcRPBmdtDK2I+LmkHYGvAB+r3KOHf9jDPx94FdgK2BC4ATgsIl6pnKekE4HxEfGV/PwYUvt2Ai+Qfr9qAumHD9clteWvIuKrkqYAs4BmSXsCl0XEbEn7AsfmcV8BvhgRt676eiRdT/oi5U7AM8C+kmYCR5La+kHgMxHxtKRbgCMi4nZJ3wd2jYitJI0mtfXmwDbAWaTDu2sB34iISwe2todW7vVdSPqhyTcDp1a06yJgv4hYKOltwHmk4L0bmER6nVdXjtfDdAJOJ70X1gZOj4jzeqjHiaT3zHrAZsDOkt5C+jyvQ3pPdLfDt4BnIuIUSQcClwFvyu11TV7eAuASYKO8iGsj4ourv8YGn88J1EDS/sBBpA30FNKH/NiIWACcC5wXEdtGxCmkDdBBEbF9HncccHCNi5oEvDsi9pe0DfCfwF4RsTVwH+nNBTAP2CM/fhewQlIbaaNyV0S8BBwDnJx7NFPIe6ySPixpdfeCh8NtpA9pLaYC/wq8jbSR/HRfI0s6GPgg8K6I2AbYPyI6gefy4+2BbYF3Snp/RNxN+pn0C3O7z5a0BXA8sE8e/1Cgr9D9Z2CXiNhX0tuB2cC/5rZeSApwWLmtdwFekvRmYAfgfyPiBeAo4JTc1m8Hfl3jemoUb4iInUm9q9mSxvcwzkXA9/MPUp5Oev19ykF5CSmMdyCtv6Pzl1h7MhX4eES8lbTR/xlwXG6T44GfSVqbldtkD9KO3u55J2wq6SdvPgE8GBFTImIKcFK1+g4Xh0Bt9gQuiYiluYt4Th7Wk2bgKEkLgD8Du5I2ILX4UUR0/2De7sDVEfFkfj63YpnzgD3zoasngGtIb8Y9gd/nca4Dvi7pWGCHiHgOICKuiIhZNdankfTnt3V/HBHL8mGWC0jrsi/7Af8dEUsBIqIjDx8FnCLpz8AdpA1sb225N7AFcGNu+x8BoyVt1Mv4l1QcBnofcE1EPJGf99TWm5JuzHQ1Pbf1cZKOA3bsbusGV3mo5TKAfI7gWVJP9u8krUta9xfl8W4l9QaqmQxsCVyW2+QPpPN+W/Yy/jURsaR7scArETEvL/NaUu9OwE3ADjkQ3k3awO9J2glbGBEvkoJhH0mnSNoPWEaDcggMvhnAjqS9vCmkD/TYGqet9Y0yPy9jX9JGonvPZI/8mIg4FfgwacPx/dzdHcl24PWT9a+x8nu31vXbX18i/eTJ1Lw3eGUfy2oC/l/uGXT/TYyIp3oZv9a2vpl0X44P0Htbn07qybQDZ0r6Ro3zHgrtpEOjwN/3ztfLw7str3i8gv4fpu7t/dAELFmlTSZFxBW9zKemNsk97btIv4f2BCmEd2blNrkFeAdp52FGHqchOQRqcy1wkKTxkpqAfwd+l8ueJ72pu61PeuMtk/RGXv8Bvf76PbBfvjkPwKe6l1nxJvxqrtvNpK70lsAfASQpIh7IJ0DPJIXGiCTpAOCzwGl50APkwwH50Mj7VpnkY5LWyRucGby+x9ybq4HPSmrJ8+zeaK0PPBERyyVtDBxQMc2q7f5b4P2SKu+fUfWQRXYd6bzAm/LzyrZ+GfgTcDSprW8l7X1unR8jaXJEPBgRc4Hv0Vht/TvgMxXPPw3cmveWaxIRz5P2/D8OkM8PTakYpfL9sAevH4cP4EVJM7pHlPTW3LOoulhgbUnvy9PtTjrf0v3bZ/OA/wDm5TZ6FPhkHo6kfwKej4jLSDsT2+cTzg3HJ4ZrEBFX5ZOB3Sf5bgNOzo9/Bvy84hDAuaSTun8BngJupB93+alY5p8lHQ/My5fZPcDKH6Z5wBeBP0VEp6SHgPsqDjF8QdJ7SV3Yl4HDIJ0TAPYeAYeEfiqp+xLRe4F9K64MOieX30s6V7LqFUO3kzbKE4DrgR9UWdaFwMbArZJeBZbldXcGcLmkhaQP+byKaa4AZuZ27z4xPB34oaRxpJOQN1HD77rnE5hHA7/Lbf1X/rGtdwBuj/Sz7A8AD1Wc7D4ib6y62/rwasscQl8AvifpLtJJ98WkYO6vmcB5eT3dzcrr9XjgAkmHkwL/EYCIeC2fzztd0pGkz+FTwIHVFhYRr0j6CHCGpO4Twx+tWOfzSOfs5lU8fxd5J4y0U/YlSStIO9uz8nmmhuMfkLM1yqpXDtmaKV9hdWpEXD3cdRnpGrJ7YmZmQ8M9ATOzgrknYGZWMIeAmVnBHAJmZgVzCJiZFcwhYGZWMIeAmVnB/j8WYO0p5bxjtAAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "50 rows deleted\n",
      "Dataset rows count AFTER dropping duplicates: 556\n"
     ]
    }
   ],
   "source": [
    "total_rows_count = data.shape[0]\n",
    "print(\"Dataset rows count BEFORE dropping duplicates:\", total_rows_count)\n",
    "\n",
    "duplicated_rows_count = data[data.duplicated()].shape[0]\n",
    "print(\"Duplicated rows count:\", duplicated_rows_count)\n",
    "print(\"% of duplicated rows to total rows in the dataset:\", duplicated_rows_count / total_rows_count * 100)\n",
    "print()\n",
    "\n",
    "\n",
    "labels = (\"Total rows:\", \"Duplicate rows\", \"Unique rows\")\n",
    "values = (total_rows_count, duplicated_rows_count, total_rows_count - duplicated_rows_count)\n",
    "plt.title(\"Dataset with duplicated rows\")\n",
    "plt.bar(labels, values)\n",
    "plt.show()\n",
    "\n",
    "# Delete duplicates\n",
    "if duplicated_rows_count > 0:\n",
    "    data = data.drop_duplicates()\n",
    "    print(duplicated_rows_count, \"rows deleted\")\n",
    "\n",
    "# Check the dataset after deletion\n",
    "print(\"Dataset rows count AFTER dropping duplicates:\", data.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "SEPARsYEJrnn"
   },
   "source": [
    "#### Deal with outliers [DEMO - WALKTHRU]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Normal distribution](https://www.mathsisfun.com/data/images/normal-distrubution-large.svg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 225
    },
    "colab_type": "code",
    "id": "icXl0KwkK7tD",
    "outputId": "d132dd24-4cc2-4c05-c59a-cf872759141d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "count    556.000000\n",
      "mean       6.273680\n",
      "std        0.723658\n",
      "min        3.561000\n",
      "25%        5.876750\n",
      "50%        6.189000\n",
      "75%        6.618250\n",
      "max        8.780000\n",
      "Name: rm, dtype: float64\n",
      "Number of outliers in the Room column (< mean - 2 x std dev OR > mean + 2 x std dev): 36\n"
     ]
    }
   ],
   "source": [
    "print(\"\")\n",
    "print(data.describe()[\"rm\"])\n",
    "rm_column = data[\"rm\"]\n",
    "\n",
    "mean=data.describe()[\"rm\"][\"mean\"]\n",
    "stddev=data.describe()[\"rm\"][\"std\"]\n",
    "two_times_stddev=2 * stddev\n",
    "\n",
    "minimum = mean - two_times_stddev\n",
    "maximum = mean + two_times_stddev\n",
    "\n",
    "num_of_outliers_rm_col = rm_column[(rm_column < minimum) | (rm_column > maximum)].count()\n",
    "print(\"Number of outliers in the Room column (< mean - 2 x std dev OR > mean + 2 x std dev):\", num_of_outliers_rm_col)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Distribution with all values\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEBCAYAAAB2RW6SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAEbhJREFUeJzt3X9sXXd5x/G37bR21Dj94TqFlP4Y3fIMoQxIqVpGOwbil7R1HRsrRLQBoYmlIPgHTcAElLEJVUAFY01JpAmpIyjTOlhLtUlFSCAIZQhYK9YhnnbQlLR0reMUaFCTtrb3xz2WstD4/jw+19/7fkmRfc+995zn63vvxyePv+ecsaWlJSRJ5RpvugBJUr0MekkqnEEvSYUz6CWpcAa9JBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKpxBL0mFW9fgtieBS4BHgIUG65CktWQCeC7wXeBYJ09oMugvAb7Z4PYlaS27AtjfyQObDPpHAB5//FcsLg7vGTRnZjYwP3+k6TJq5zjLMirjhNEZ6/I4x8fHOPPM06DK0E40GfQLAIuLS0Md9MDQ1zcojrMsozJOGJ2xnjDOjlve/jFWkgpn0EtS4Qx6SSqcQS9JhTPoJalwBr0kFc6gl6TCdTSPPiKmgE8BrwaOAt/OzHdExBbgFmAGmAd2ZOb9dRUrdWN643qmJn/9LT47O83RY8/wxC+fbKAqafV1esDUx2kF/JbMXIqIc6rlu4Fdmbk3Iq4B9gCvqqFOqWtTk+u48r23P+t9d9x4FU+scj1SU9q2biJiA7AD+FBmLgFk5qMRsQnYBuyrHroP2BYRs3UVK0nqXid79BfRastcHxGvBI4AHwSeBB7OzAWAzFyIiJ8B5wFzNdUrSepSJ0E/ATwfuDsz/zIiLgXuAP5sEAXMzGwYxGpqNTs73XQJq2JUxrms9PGWPr7jjcpYex1nJ0H/U+AZqhZNZn4nIg7R2qM/NyImqr35CWAzcLCbAubnjwz1CYlmZ6eZmyu/m1viONt9KEob7/FKfD1PZlTGujzO8fGxrneQ2/boM/MQ8DXgNQDVTJtNwH3APcD26qHbae3127aRpCHS6Tz6ncBfRcR/Af8EXJuZP6+Wvzsi7gPeXd2WJA2RjqZXZuZPgN9/luU/Ai4dcE2SpAHyyFhJKpxBL0mFM+glqXAGvSQVzqCXpMIZ9JJUOINekgpn0EtS4To9H71UlKeeXljxXDhemEQlMeg1kk49ZeKkFyUBL0yisti6kaTCGfSSVDiDXpIKZ9BLUuEMekkqnEEvSYUz6CWpcAa9JBXOA6a0Zk1vXM/UZDNv4ZW27VG1GjYGvdasqcl1bY9ubWLbHlWrYWPrRpIKZ9BLUuEMekkqnEEvSYUz6CWpcB3NuomIA8DR6h/A+zLzzoi4DNgDrAcOANdk5mODL1OS1Ktuple+MTPvXb4REePAXuBtmbk/Ij4I3AC8fcA1SpL60E/r5mLgaGbur27vBq7uvyRJ0iB1E/RfiIgfRMTNEXEGcD7w4PKdmXkIGI+IswZdpCSpd522bq7IzIMRMQl8GrgJ+NdBFDAzs2EQq6nVSheRLsmojLNT/fw8huFnOQw1rJZRGWuv4+wo6DPzYPX1WETcDHwZ+DvgguXHRMTZwGJmHu6mgPn5IywuLnXzlFU1OzvN3Fz5B7SvxXHW/eFe6efRbttN/yzX4uvZq1EZ6/I4x8fHut5Bbtu6iYjTIuL06vsx4M3APcD3gfURcXn10J3ArV1tXZJUu0726M8BvhgRE8AE8EPgnZm5GBHXAnsiYopqemVtlUqSetI26DPzJ8BLTnLfXcDWQRclSRocj4yVpMIZ9JJUOINekgpn0EtS4Qx6SSqcQS9JhTPoJalwBr0kFc6gl6TCGfSSVDiDXpIKZ9BLUuEMekkqnEEvSYUz6CWpcAa9JBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKpxBL0mFW9d0AdIweurpBWZnp5suQxoIg156FqeeMsGV7739pPffceNVq1iN1B9bN5JUuK726CPieuAjwNbMvDciLgP2AOuBA8A1mfnYoIuUJPWu4z36iNgGXAY8WN0eB/YC78rMLcA3gBvqKFKS1LuOgj4iJoFdwHXHLb4YOJqZ+6vbu4GrB1ueJKlfne7RfxTYm5kHjlt2PtXePUBmHgLGI+KswZUnSepX2x59RLwMeCnw/joKmJnZUMdqB2pUptmNyjhXwzD8LIehhtUyKmPtdZyd/DH2FcALgAciAuB5wJ3AZ4ALlh8UEWcDi5l5uJsC5uePsLi41M1TVtXs7DRzc080XUbt1uI4h/nD3fTPci2+nr0albEuj3N8fKzrHeS2rZvMvCEzN2fmhZl5IfAQ8DrgE8D6iLi8euhO4NbuSpck1a3nefSZuQhcC3w2Iu6ntedfS3tHktS7ro+Mrfbql7+/C9g6yIIkSYPlkbGSVDiDXpIKZ9BLUuEMekkqnEEvSYUz6CWpcAa9JBXOoJekwnkpQWnA2l1v9uixZ3jil0+uYkUadQa9NGCdXG+2/FNwaZjYupGkwhn0klQ4g16SCmfQS1LhDHpJKpxBL0mFM+glqXAGvSQVzqCXpMIZ9JJUOINekgpn0EtS4Qx6SSqcQS9JhTPoJalwHZ2PPiJuA34DWASOAO/OzHsiYgtwCzADzAM7MvP+uoqVJHWv0z36t2bmizLzJcAngc9Vy3cDuzJzC7AL2FNDjZKkPnQU9Jn5i+Nung4sRsQmYBuwr1q+D9gWEbODLVGS1I+OLyUYEf8AvBYYA14PnAc8nJkLAJm5EBE/q5bPdbremZkNXRXchJWu/1mSURln09pdU/appxc49ZSJvrczSq/nqIy113F2HPSZ+ecAEXEt8AngQz1t8QTz80dYXFwaxKpqMTs7zdxc+Vf4XIvjXKsf7k6uKdvva7EWX89ejcpYl8c5Pj7W9Q5y17NuMvPzwCuBh4BzI2ICoPq6GTjY7TolSfVpG/QRsSEizjvu9pXAYeAx4B5ge3XXduDuzOy4bSNJql8nrZvTgFsj4jRggVbIX5mZSxGxE7glIj4MPA7sqK9USVIv2gZ9Zj4KXHaS+34EXDrooiRJg+ORsZJUOINekgpn0EtS4Qx6SSqcQS9Jhev4yFipDtMb1zM1efK34dFjz/DEL59cxYqk8hj0atTU5Lq2pwMo/+B2qV62biSpcAa9JBXOoJekwhn0klQ4/xirodbuIh2S2jPoNdRWukjHHTdetcrVSGuTrRtJKpxBL0mFM+glqXAGvSQVzqCXpMI560YaMitNKfUkb+qFQS8NmXZTSj3Jm7pl60aSCmfQS1LhDHpJKpxBL0mFM+glqXBtZ91ExAzweeAi4CngfuAvMnMuIi4D9gDrgQPANZn5WH3lSpK61cke/RLw8cyMzNwK/Bi4ISLGgb3AuzJzC/AN4Ib6SpUk9aJt0Gfm4cz8+nGL/gO4ALgYOJqZ+6vlu4GrB16hJKkvXR0wVe3FXwd8GTgfeHD5vsw8FBHjEXFWZh7udJ0zMxu6KaERo3Lhi1EZ51rX6es0Sq/nqIy113F2e2Ts3wNHgJuAN/S0xRPMzx9hcXFpEKuqxezsNHNz5R+L2NQ4R+UDOkidvE6j8r6F0Rnr8jjHx8e63kHueNZNRHwS+C3gTZm5CPyUVgtn+f6zgcVu9uYlSfXrKOgj4mO0evJ/nJnHqsXfB9ZHxOXV7Z3ArYMvUZLUj06mV74Q+ABwH3BXRAA8kJlviIhrgT0RMUU1vbLGWiW1Mb1xPVOTrY/1iW0xz3w5utoGfWb+NzB2kvvuArYOuihJvZmaXOeZL/VrPDJWkgrn+eilNWSli5JIJ2PQS2vIShclgVZ7RjqRrRtJKpxBL0mFM+glqXAGvSQVzqCXpMI560a1Ov5ITUnN8BOoWq10pCY4HVBaDbZuJKlwBr0kFc6gl6TCGfSSVDiDXpIKZ9BLUuEMekkqnEEvSYXzgClpRLS7aInXlC2XQS+NiE4uWuI1Zctk60aSCmfQS1LhDHpJKpxBL0mFM+glqXBtZ91ExCeBPwUuBLZm5r3V8i3ALcAMMA/syMz76ytVktSLTvbobwN+D3jwhOW7gV2ZuQXYBewZcG2SpAFoG/SZuT8zDx6/LCI2AduAfdWifcC2iJgdfImSpH70esDUecDDmbkAkJkLEfGzavlcNyuamdnQYwmrZ6WjCUsyKuPUya3V98BarbtbvY6z8SNj5+ePsLi41HQZJzU7O83cXPnHC9Y1zlH5AJZiLb7XR+0zOj4+1vUOcq+zbg4C50bEBED1dXO1XJI0RHoK+sx8DLgH2F4t2g7cnZldtW0kSfXrZHrlZ4A/AZ4DfDUi5jPzhcBO4JaI+DDwOLCj1kol1cqzW5arbdBn5nuA9zzL8h8Bl9ZRlKTV59kty+WRsZJUuMZn3Wh1TG9cz9TkyV/up55eqG3dKsNKrR3bOsPNT+eImJpc1/a/5XWsu5/1aris1NqxrTPcbN1IUuEMekkqnK0bSY1q9zce+//9M+glNaqTvx/Z/++PrRtJKpx79AKcOqfRtFLbqKT3vUEvwKlzGk3tpgaX8r63dSNJhXOPXlLtPHq6Wf7kJdXOo6ebZetGkgpn0EtS4WzdSBpq7S6I0s+ZV0eFQS9pqHVyQRStzNaNJBXOPfoueQImabgM61Hdw3TUrUHfJU/AJA2XYT2qe5iOurV1I0mFM+glqXBrtnUzir3ypnp+7aa3ScP6HhnWulbbmg36UeyVN9Xzc3qb2hnW98iw1rXabN1IUuH63qOPiC3ALcAMMA/syMz7+11vnZps+6y07WNPLTB56kQt25XUnXZtn7XUHh5E62Y3sCsz90bENcAe4FUDWG9tmmz7tGu/+N9MaTh00vZZK+3hvoI+IjYB24DXVIv2ATdFxGxmzrV5+gTA+PhYz9vfdOb6Fe9fad3dPPfE9fSz3XbPb7fule7vZ7t11tXPc0exrjrXbV2DfW4/n7lesm98fOz453X83/+xpaWlrje2LCIuBv4xM1943LIfAtdk5n+2efrlwDd73rgkjbYrgP2dPLDJWTffpVXoI4Cnn5OkzkwAz6WVoR3pN+gPAudGxERmLkTEBLC5Wt7OMTr8bSRJ+n9+3M2D+5pemZmPAfcA26tF24G7O+jPS5JWSV89eoCI+G1a0yvPBB6nNb0yB1CbJGkA+g56SdJw88hYSSqcQS9JhTPoJalwBr0kFW7NnqZ4tUTE9cBHgK2ZeW/D5QxcRBwAjlb/AN6XmXc2VlBNImIK+BTwalpj/XZmvqPZqgYvIi4Ebjtu0RnAxsw8q5mK6hMRfwj8DTBW/fvrzPxSs1UNXkT8Aa1xngIcBt6WmQ90sw6DfgURsQ24DHiw6Vpq9sYSf4md4OO0An5LZi5FxDlNF1SHzDwAvHj5dkR8mgI/5xExBnweuCIz742I3wG+FRG3ZeZiw+UNTEScSWv6+u9m5n3ViSM/C7y+m/XYujmJiJgEdgHXNV2L+hMRG4AdwIcycwkgMx9ttqr6RcSpwFuAzzVdS00WgdOr788AHikp5Cu/CTyamfdVt/8deF1EnN3NSgz6k/sosLfaQyrdFyLiBxFxc0Sc0XQxNbiI1rUSro+I70XE1yPi8qaLWgV/BDzcwQkG15zqF/bVwO0R8SCtdtWOZquqxX3AcyLikur2W6qv53ezEoP+WUTEy4CXAjc3XcsquCIzXwRcQqvPeVPD9dRhAng+rdNzvBR4H/CliNjYbFm1ezuF7s1HxDrgA8BVmXkBcCXwz9X/3oqRmb8A3gR8KiK+B2wCfg480816DPpn9wrgBcAD1R8rnwfcGRGvbbKoOmTmwerrMVq/2F7ebEW1+CmtD8Y+gMz8DnAI2NJkUXWKiHNpvY+/0HQtNXkxsDkzvwVQff0Vrc9tUTLzq5l5ebWTchOwntU8qVmpMvOGzNycmRdm5oXAQ8DrMvMrDZc2UBFxWkScXn0/BryZ1knqipKZh4CvUV0gp7r85Sbgf5qsq2ZvBf4tM+ebLqQmDwHPi4gAiIgXAOfQZQCuBRHxnOrrOPAxYHdm/qqbdRT313h15Rzgi9XppSeAHwLvbLak2uwEPhcRNwJPA9dm5s8brqlObwPe03QRdcnM/42I64B/iYjlP8C+PTMPN1lXTf42Il4OnAp8BXh/tyvwpGaSVDhbN5JUOINekgpn0EtS4Qx6SSqcQS9JhTPoJalwBr0kFc6gl6TC/R/ixTFOP9g/hAAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Distribution without outliers\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEBCAYAAAB2RW6SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAEoxJREFUeJzt3WuMXGd9x/Hv7ia+NF5DsoyRYnKDxP9W1AUcUFIRREEUaKUQULlFJAahFgyIvChUFEoShFQUARGUxtRWUYrBkV+40KQpFbRIVCQNRQXiF4HmT7g4cRzA602o7Sq+7W5f7FlpZWzPmcvOznn2+5GsmTlzzpz/4zPz07PPuY3Mzs4iSSrX6FIXIElaXAa9JBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKpxBL0mFM+glqXAGvSQVzqCXpMKds4TrXgm8BPgFML2EdUhSk4xVj48CJ+sssJRB/xLgviVcvyQ12WXA3jozLmXQ/wLgqaf+j5mZwVxBc2JiDVNTRwayrkGyXc1iu5pnmNo2OjrC+eef19EySxn00wAzM7MDC/r59ZXIdjWL7WqeJrfNnbGSVLi2PfqIuBS4e8GkZwJrM/OCiNgA7AAmgClgc2Y+shiFSpK60zboM3Mv8ML51xHx2QXLbQO2ZubOiLgB2A68chHqlCR1qaOhm4hYAbwNuDMi1gGbgF3V27uATRHR6m+JkqRedLoz9nXA/sz8QURcWT2fBsjM6Yh4ArgImKz7gRMTazosoTet1vhA1zcotqtZbFfzNLltnQb9O4E7+1nA1NSRge3NbrXGmZw8PJB1DZLtahbb1TzD1LbR0ZGOO8i1h24iYj3wcuCuatI+YH1EjFXvjwEXVtMlSUOikx7924GvZeYUQGYeiIg9wPXAzurxwcysPWyj8o2vXc2qld2frjG+djWHDz3dx4qk5aeTX+A7gJtOmbYF2BERtwBPAZv7VJcKsWrlOVz7gXu6Xv7e269jOP5glpqrdtBn5obTTHsYuKqvFUmS+sozYyWpcAa9JBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKpxBL0mFM+glqXAGvSQVzqCXpMIZ9JJUOINekgpn0EtS4Qx6SSqcQS9JhTPoJalwBr0kFc6gl6TCGfSSVDiDXpIKd06dmSJiFfAZ4FXAUeA7mfmuiNgA7AAmgClgc2Y+sljFSpI6V7dH/0nmAn5DZm4Ebq6mbwO2ZuYGYCuwvf8lSpJ60TboI2INsBm4OTNnATLzVxGxDtgE7Kpm3QVsiojWYhUrSepcnaGb5zE3LHNrRLwCOAJ8FHga2J+Z0wCZOR0RTwAXAZN1C5iYWNNx0b1otcYHur5BKbVdUGbbSmwTlNsuaHbb6gT9GPBc4MHM/IuIuAq4F3hTPwqYmjrCzMxsPz6qrVZrnMnJwwNZ1yANc7v68eMY1rZ1a5i3Vy9KbRcMV9tGR0c67iDXGaN/DDhJNUSTmd8FDjLXo18fEWMA1eOFwL6OKpAkLaq2PfrMPBgR3wL+EPi36kibdcCPgT3A9cDO6vHBzKw9bCO1c/zEdNd/FRw9dpLDh57uc0VS89Q6vBLYAtwZEbcDJ4AbM/PXEbEF2BERtwBPMbfTVuqbFeeOce0H7ulq2Xtvv47h+GNbWlq1gj4zfwb8wWmmPwxc1eeaJEl95JmxklQ4g16SCmfQS1LhDHpJKpxBL0mFM+glqXAGvSQVzqCXpMIZ9JJUOINekgpn0EtS4Qx6SSqcQS9JhTPoJalwBr0kFa7ujUekxunl7lTHjk+zcsVYV8t6ZysNG4Nexer17lTe2UqlcOhGkgpn0EtS4Qx6SSqcQS9Jhau1MzYi9gJHq38AH8rMb0TE1cB2YDWwF7ghMw/0v0xJUrc6OermjZn50PyLiBgFdgLvyMz7I+KjwG3AO/tcoySpB70M3VwJHM3M+6vX24A3916SJKmfOunR3xURI8D9wEeAi4FH59/MzIMRMRoRF2Tmk3U/dGJiTQcl9K7bE2iGXantaqp226PU7VVqu6DZbasb9C/LzH0RsRL4LHAH8E/9KGBq6ggzM7P9+Ki2Wq1xJifLO5VlmNvV5B9HL862PYZ5e/Wi1HbBcLVtdHSk4w5yraGbzNxXPR4DPg+8FHgMuGR+noh4FjDTSW9ekrT42gZ9RJwXEc+ono8AbwX2AN8HVkfENdWsW4Ddi1WoJKk7dYZung18JSLGgDHgR8B7M3MmIm4EtkfEKqrDKxetUklSV9oGfWb+DHjRGd57ANjY76IkSf3jmbGSVDiDXpIKZ9BLUuEMekkqnEEvSYUz6CWpcAa9JBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKpxBL0mFM+glqXAGvSQVzqCXpMIZ9JJUOINekgpX556xkjpw/MQ0rdb4Wec52/tHj53k8KGn+12WljGDXuqzFeeOce0H7ul6+Xtvv47DfaxHcuhGkgrXUY8+Im4FPgZszMyHIuJqYDuwGtgL3JCZB/pdpCSpe7V79BGxCbgaeLR6PQrsBN6XmRuAbwO3LUaRkqTu1Qr6iFgJbAXes2DylcDRzLy/er0NeHN/y5Mk9aru0M3HgZ2ZuTci5qddTNW7B8jMgxExGhEXZOaTdQuYmFhTu9h+aHc0RFOV2q7lqqnbs6l119HktrUN+oj4feDFwF8uRgFTU0eYmZldjI/+Da3WOJOT5R3PMMztavKPYykN6/Y8m2H+HvZqmNo2OjrScQe5ztDNy4HfAX4eEXuB5wDfAC4HLpmfKSKeBcx00puXJC2+tkGfmbdl5oWZeWlmXgo8DrwG+BSwOiKuqWbdAuxetEolSV3p+jj6zJwBbgT+LiIeYa7nvyjDO5Kk7nV8ZmzVq59//gCwsZ8FSZL6y0sgqK3xtatZtdKvitRU/nrV1qqV53R97ZZ7b7+uz9VI6pTXupGkwhn0klQ4g16SCmfQS1LhDHpJKpxBL0mFM+glqXAGvSQVzqCXpMIZ9JJUOINekgpn0EtS4Qx6SSqcQS9JhTPoJalwBr0kFc4bjywT3iVKWr785S8T3iVKWr4cupGkwtXq0UfE3cBlwAxwBHh/Zu6JiA3ADmACmAI2Z+Yji1WsJKlzdXv0b8/MF2Tmi4BPA3dW07cBWzNzA7AV2L4INUqSelAr6DPzfxe8fAYwExHrgE3Armr6LmBTRLT6W6IkqRe1d8ZGxBeAVwMjwGuBi4D9mTkNkJnTEfFENX2y7udOTKzpqOBetVrjA13foJTaruWqqduzqXXX0eS21Q76zPxTgIi4EfgUcHM/CpiaOsLMzGw/PqqtVmucycnDA1nXINVpV5O/pMtRE7+npf6+YLjaNjo60nEHueOjbjLzy8ArgMeB9RExBlA9Xgjs6/QzJUmLp22PPiLWAOdn5r7q9bXAk8ABYA9wPbCzenwwM2sP20j6TcdPTHf9F9jRYyc5fOjpPlekpqszdHMesDsizgOmmQv5azNzNiK2ADsi4hbgKWDz4pUqLQ8rzh3r6eS24Rhg0DBpG/SZ+Svg6jO89zBwVb+LkiT1j2fGSlLhDHpJKpxBL0mFM+glqXAGvSQVzqCXpMJ54xGpIJ5spdMx6KWCeLKVTsehG0kqnEEvSYUz6CWpcAa9JBXOnbGSgN6P2NHwMuglAb0fsaPh5dCNJBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKpxBL0mFa3scfURMAF8GngccBx4B3p2ZkxFxNbAdWA3sBW7IzAOLV64kqVN1evSzwCczMzJzI/BT4LaIGAV2Au/LzA3At4HbFq9USVI32gZ9Zj6Zmf+xYNJ/AZcAVwJHM/P+avo24M19r1CS1JOOLoFQ9eLfA/wzcDHw6Px7mXkwIkYj4oLMfLLuZ05MrOmkhJ51ey2PYVdqu9QsJX8Pm9y2Tq9187fAEeAO4A39KGBq6ggzM7P9+Ki2Wq1xJifLu4dOnXY1+Uuq5ijx9wXDlR2joyMdd5BrH3UTEZ8GrgDekpkzwGPMDeHMv/8sYKaT3rwkafHVCvqI+ARzY/Kvz8xj1eTvA6sj4prq9RZgd/9LlCT1os7hlc8HPgz8GHggIgB+nplviIgbge0RsYrq8MpFrFWS1IW2QZ+ZPwRGzvDeA8DGfhclSeofbzwyYONrV7NqZXf/7UePneTwoaf7XJGk0hn0A7Zq5Tk93cVnOPb7S2oSr3UjSYUz6CWpcAa9JBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKpxBL0mFM+glqXBeAqFBjp+YPuMNRLyxiJqql+s/gdeAqsOgb5AV5471dJ0caRj1cv0n8BpQdTh0I0mFM+glqXAGvSQVzqCXpMI1dmdst3vqW61x99JLWlYaG/TeqUmS6nHoRpIK17ZHHxGfBv4EuBTYmJkPVdM3ADuACWAK2JyZjyxeqZKG1fET06w4d8wT94ZUnaGbu4G/Ae47Zfo2YGtm7oyIG4DtwCv7XJ+kBvBkvuHWdugmM+/PzH0Lp0XEOmATsKuatAvYFBGt/pcoSepFtztjLwL2Z+Y0QGZOR8QT1fTJTj5oYmJNlyX0xj8xpXIM4vfc5MxY8qNupqaOMDMz2/Fyvf6nT04uzXE3Tf6ySMNqsX/Prdb4kmXGqUZHRzruIHd71M0+YH1EjAFUjxdW0yVJQ6SroM/MA8Ae4Ppq0vXAg5nZ0bCNJGnxtQ36iPhcRDwOPAf4ZkT8sHprC/D+iPgx8P7qtSRpyLQdo8/Mm4CbTjP9YeCqxShKktQ/S74zVpKWSifXzDr1QIomXTPLoJe0bC2Xa2Z5rRtJKpw9ekmNdvzEtOentGHQS2o0r7PTnkM3klQ4g16SCmfQS1LhDHpJKpw7YyWpC70c7TPok60MeknqQq9H+wzyZCuHbiSpcAa9JBXOoJekwhn0klS4Zbkztpe95ceOT7NyxVifK5KkxbMsg77XveXdLju/vCQNkkM3klQ4g16SCmfQS1LhDHpJKlzPO2MjYgOwA5gApoDNmflIr58rSeqPfvTotwFbM3MDsBXY3ofPlCT1SU89+ohYB2wC/rCatAu4IyJamTnZZvExgNHRka7Xv+781Y1bdinX3cRll3LdtrkZyy7luntZttvs62a5kdnZ2a5WBhARVwJfysznL5j2I+CGzPxBm8WvAe7reuWStLxdBuytM+NSnjD138DLgF8A00tYhyQ1yfyp+Y/XXaDXoN8HrI+Iscycjogx4MJqejvHgPt7XL8kqY2edsZm5gFgD3B9Nel64MEa4/OSpAHpaYweICJ+m7nDK88HnmLu8MrsQ22SpD7oOeglScPNM2MlqXAGvSQVzqCXpMIZ9JJUuOLuMBURe4Gj1T+AD2XmN06Z57eAfwCuBE4CH8zMfxlgmR2r2a4vAq8CDlaTdmfmXw+oxK5ExCrgM8zVfRT4Tma+65R5xoDPAa8FZoHbMvMLg661EzXb9THgvcAT1aT/zMz3DbLOTkXEpcDdCyY9E1ibmRecMl+jtlkH7foYDdtmUGDQV96YmQ+d5f0PAocy8/KIuAK4LyIuz8wjA6qvW+3aBXM/qDsGUk1/fJK5INyQmbMR8ezTzPM24HLgCuaukvpgRHwzM/cOrsyO1WkXzF1C5IMDrKsn1f/5C+dfR8RnOX2ONGqbddAuaNg2g+U7dPMWqqtsVpdU/h7wR0ta0TIUEWuAzcDNmTkLkJm/Os2sbwH+PjNnqpPx7gbeNLhKO9NBuxotIlYwF+h3nubtRm2zhdq0q5FK7dHfFREjzF1i4SOZ+etT3r8YeHTB68eAiwZVXA/atQvgzyPi3cBPgQ9n5v8MtMLOPI+5exjcGhGvAI4AH83MUy+N0bTtVbddAG+NiFcDvwRuzczvDLDOXr0O2H+GCxg2bZstdLZ2QQO3WYk9+pdl5guAlwAjQJOGMc6mTrv+Crg8MzcCXwW+Xo2VDqsx4LnMXTbjxcCHgK9GxNqlLatnddu1DbgsM38P+BRwT0RMDLbUnryTgnq9C5ytXY3cZsUFfWbuqx6PAZ8HXnqa2R4DLlnw+mLqXYhtydRpV2buz8yZ6vmXgDXAcwZZZ4ceY25n+C6AzPwuczuSN5xmviZtr1rtysxfZuaJ6vm/M9em3x1sqd2JiPXAy4G7zjBL07YZ0L5dTd1mRQV9RJwXEc+ono8Ab2Xuomun2g28u5rvCuZ6yV8fVJ2dqtuu6ks6//w1zF3+ef+g6uxUZh4EvkV145rqtpTrgJ+cMutu4M8iYjQiWsDrgX8cZK2dqNuuU7bXC4FLgaZcJ+rtwNcyc+oM7zdqmy1w1nY1dZuVNkb/bOAr1XDFGPAj5g6FIiL2AH+cmU8w9yfXFyPiJ8yF4bsy8/AS1VxH3XbtqI7umAEOAa/LzJNLVHNdW4A7I+J24ARwY2b+OiL+FbglM78HfBm4Cpi/F/HHM/PnS1NubXXa9Ynq5j3TwPFqnl8uXckdeQdw08IJBWwzaN+uRm4zL2omSYUrauhGkvSbDHpJKpxBL0mFM+glqXAGvSQVzqCXpMIZ9JJUOINekgr3/73i31SubcXCAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "\n",
    "print(\"Distribution with all values\")\n",
    "plt.hist(data[\"rm\"],bins=40)\n",
    "plt.show()\n",
    "print();\n",
    "print(\"Distribution without outliers\")\n",
    "data_without_outliers=data[(data[\"rm\"] > minimum) & (data[\"rm\"] < maximum)]\n",
    "plt.hist(data_without_outliers[\"rm\"],bins=20)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "dSdqmARsDwbM"
   },
   "source": [
    "#### Deal with missing data [DEMO - WALKTHRU]\n",
    "\n",
    "Make an informed decision about which rows to eliminate based on which column or columns have missing data.\n",
    "\n",
    "The decisions can be many:\n",
    "- remove rows with one or more missing values\n",
    "- remove rows with the number of missing values above a certain threshold (i.e. more than 75% of the columns with missing values)\n",
    "- fill the missing values with computed values\n",
    "    - mean\n",
    "    - imputed\n",
    "    - predicted\n",
    "    - other transformations\n",
    "    \n",
    "These decisions and transformations are purely dependent on the goal and objective of behind the analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 2184
    },
    "colab_type": "code",
    "id": "C4HRvWodDxjh",
    "outputId": "6c66fc2e-d8b5-4a62-888c-a6f41ac5bd60"
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZEAAAELCAYAAAAY3LtyAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAIABJREFUeJzt3Xu8pXPd//HXzBQ51aApOYRO75AcJ4cSnVP5KSGnknRC55NuuQu3pKLuhFs0Cjkk0m1KRuWWDpQwJPWWMg4hhxljBjMxM78/vt/NstvHtfdea++13s/HYz/2Wte6Dp/vutZan+v6Xtf1uSYtW7aMiIiIZkxudwARETFxJYlERETTkkQiIqJpSSIREdG0JJGIiGhakkhERDQtSaQLSZoj6bXtjqPdJH1X0pFNTruCpJmS5kv6wWjHNlFI2lvSJeMgjudKWihpygjmsVDS80YzrpGaCN/Vp7Q7gFaSNAd4NvAYsAS4ETgdONn20iFMvx5wC/BU24+NYZwtWU4nk7QD8D3ba4/RInalfJZW7+Z1ZPtM4MxxEMdtwMojnMeIpu9W3bgnspPtVYB1gaOBg4EZ7Q0pJqB1gZuaSSCSRrTxNtLpI0ZT134Ybc8HLpR0N3ClpGNt3yDpzcCRwPOB+cAM24fVyS6v/x+QBPA64B7gFGATYBkwCzjI9gMAkg4GPgI8HbgTOND2LyRNBj4DvA+YCvwC+KDtuX0tx/YVPbFLWhP4G7BWHR9JmwE/A54DPHegmBpJ+i5wh+1D6/MdaNiCr8v6JvBKYCHwddvH1ddeBpwIvAh4BDjT9if6er8l7QwcDjwPuLfGc3Gd/0nAK4C5wJdtnzLE2OYAxwPvovyoXwzsC0wBfgosL2lhDeFFtu/sI7RnSvoZsDVwDfAu27fW+b+4tn2LGvN/2j5X0uHAfwCTJL0V+CjwHeAQyvpcocbyYdvzG/Ys3wt8AZgDvFLS1sDXgA2BW4GP2r6sn/dvDvA/wN7lqVYCnkX/6+YwYCNgMbBzXebb69/H6/D9bV9Sx+9zPQzhs7Y38F7br6ivLQMOAD4JTKPspXzI9rLa1fSVuo4WAMfW+Pvc465tPgF4J+X7eE59j79b4/wdsJvteb333iW9G/h8jeE+4FDbZ0p6AWWjcVPgUeAXtt/REPsLbd9cP3sPAevV9/dGYC/bf6vjvr7GvkZt40bAGba/3asNE+67OlzduCfyJLZ/D9wBbFcHPUT5UZoKvBk4oP5QQFk5AFNtr1x/2CcBXwLWBDYA1gEOg/JNBz4ETK97P2+gfJkBPgy8Fdi+TjuP8oXpbzmNMd8JXEH5QeixF3Ce7UcHimk4aqKbCVwHrAW8BviYpDfUUb4BfMP20ylf8nP7mc/LKN2Gn6a8r6/kiffhHMr7vyali+goSa8eRpi7A28E1gdeCrzb9kPAjsCd9f1buZ8EAuVH8L+AZwKzqV0z9Uf6Z8BZlB/rPYATJW1o+wvAUcD367xnAO+uf6+iJMqVKQmu0faU9fEGSWsBP6FssKwGfAo4X9K0Adq6J+UzORVYysDrBmAn4AxgVeBayg/U5Dr+EcC3Gsbtcz0M4bPWl7cA0ynrY3fK5x5Kgt2R8gO+OeXzP5i3UzbWXlTb81NKIplW2/KR3hPUdXccsGP93m1LWbdQ1vUllPdkbcqPbn/2oGz4rArcDHyxzv+ZwHmUDYnVAddl/JuJ9l1tRtfuifRyJ+WLTK8tweslnU358v+orwlt30z5gAHcK+lrlK1NKMddlgc2lHSv7TkNk36QsoV2Bzy+5XibpHcOMeazKB/GUyRNonzg9x5CTMMxHZhm+4j6/O+STqnLmkXZknuBpGfavg+4sp/57A+cavtn9fk/ACStA7wceLPtRcBsSd+mJPFLhxjjcT0JQtJMyg/UcPzE9uV1+s8B82tc2wJzbH+njnetpPOB3Sg/LL3tDXzN9t/rvP4DuEHSfg3jHFYTHJL2AS6yfVF97WeS/gC8CThtgLbeXqffioHXDcCvbM+q4/8A2AU42vYSSecAJ0uaCqzCwOuh389aP46uW9IPSPo/yjq5mJJQvtHwmT+a8mM3kG/a/mcd/1fAPbavrc8vGGD6pcBLJN1m+y7grjr8Ucpe65o1jl8PsOwL6kYmks6k7DVCWUd/sv3D+tpxlI2A/kyk7+qwJYkUa1F24Xu+nEcDLwGWoySBfs++kfRsSpbfjvJlnEzZq6DuFn+MsmWxkaRZwCfqj966wAWSGg/oL6EcrB2K84FvSnoOZSttKfCrwWIapnWBNSU17lpP6VkOJTkcAfxF0i3A4bZ/3Md81gEu6mP4msBc2wsaht0KbDmMGO9uePxwnedw3N7zwPZCSXPrPNYFturV9qdQtuz7siYl9h631vEb1+ftDY/XBXaTtFPDsKcC/zeUWBl83QD8s+HxI8B9tpc0PIeyxzTYeuj3s9aP3uuk54D1mr3a0Pi4P73b0Pv5vx0Mt/2QpHdQfthnSPoN8Enbf6F0If8X8HtJ84BjbZ86knbUrro7BmjDRPquDlvXJxFJ0ylJpGeL5CxKN8SOthdJ+m9KVweUPsvejqrDN7Y9t3Z9Pd6NYfss4CxJT6d0H3yZ0sd7O/Ae27/pI6Z1B4u79gNfAryDsht8ju2e+AaMqZeHgBUbnq/R8Ph24BbbL+wnhr8Ce9Zd6V2A8ySt3rO13Ws+z+9jFncCq0lapeEH7LnUPZVBYhvMUMtTr9PzQNLKlD3SO2vMv7T9uiHOp2fDoMdzKWcB/pPSbdI7ptspfejvG+L8+5q+33UzTAOuh0E+a8NxF0+8F9Dw3o+2ugc2S9IKlC7DU4DtbN9N6VZD0iuAn0u6vO4RDNWT2lH3Lvo9C3CCfVeHrWuTSP1RfyVlK+B7tv9YX1qFslW2qPbl70XpQ4VycHUppc/7pobx51O6Qdai9Pv3LEOUBPUbYBFly6nnPPaTgC9K2tf2rbUvfFvb/9vPcvpyFuXssnWBxuMI/cbUh9nAJ1Wul1gO+FjDa78HFqicHHAc8C/Kl2AF21fVLplZtu9t2ALq61TpGcAlkn5M2dJ+DrCK7b9I+i3wJUmfomyl7c8TXSUDxTaYfwKrS3qGy0kU/XlT/TH5PWUL9Urbt9dYj67di+fUcTcFFtr+cx/zORs4WNJPKeuv55jJY+Vj8G++B1xV+6x/TtkL2Rq4uae7ZxADrpshTP+42t6B1gP0/1kbjnOBj0r6CeUH8eAm5zOgunW/NeV9fYRykHlpfW034Ir6Hs+j/IAPenp/Lz8Bjq8/+D+mdE0PtoEzUb6rw9aNB9ZnSlpAydyfo/RzNvZbHwgcUcf5PA0HoGw/TDm49htJD6icXXM45SDhfMqH64cN81qe0jV2H2XX+FmUg3FQkteFlB/XBZQ+yq0GWE5fLgReCNxt+7qG4QPF1NsZlINxcyjJ8vsN7V1COUi6KeXMl/uAbwPPqKO8EfiTyhlQ3wD2sP0IvdR+5f2Ar9eYfskTW+17Us6AuRO4APiC7Z8PFttgatfF2ZS+4QdUzlzpy1mUPui5lLOw9qnTLwBeT+lTvpOy/r5MWad9ObXGeznlvVpEOXmiv/hup5w1dQgl6dxO+QEZ0ndyCOtmuAZaD9D/Z204TqGsx+spB/ov4olrtkbTZOATlLbMpRzTPKC+Nh34Xf3MXkg5I+7vw5l5PaawG+VMs/spZ9f9gXLGW38mxHe1GZNyU6qIaAdJOwIn2R60+3Y8q11EdwB72x7omFZH6trurIhorXp84lWUrehnU/YAL2hrUE2q3ZC/o3SXfZpyqu6onfE0kXRjd1ZEtMckSvfNPEp31p8pXcYT0TaUiwjvo1y/8tbR6h6aaNKdFRERTcueSERENK1Tj4ksTzkL4y5G/8yPiIhONYVyCv5VDHy22eM6NYlMZ+AraiMion/bMXBJmMd1ahK5C2DevIdYurT7jvmsvvrK3H//wsFH7FBpf9qf9jfX/smTJ7HqqivBE7XGBtWpSWQJwNKly7oyiQBd2+4eaX/a381Gof1DPgyQA+sREdG0JJGIiGhakkhERDQtSSQiIpqWJBIREU1LEomIiKYliURERNOSRCIiomlJIhER0bRxlUQkvVvSee2OIyIihmZcJZGIiJhYmq6dJWkZcCjwVmB14H3Aayk3hH8qsJvtP9dx9wUOrMubDxxg25KWA74JvJpyh7BrG+b/V2DXnpvaS/oQsIXt/ZqNOSIiRtdICzA+YHu6pN2A/wX2sP0fkj4DfA7YR9J2wO7AK20vlrQjcCrwcuADwPrAhpTEczkwp877NGBf4BP1+X7Ax4cT3OqrrzyStk1o06at0u4Q2irtT/u7WSvbP9Ik8v36/xpgme0f1+dXA7vUxzsBmwC/kwTlPsur1tdeBZxm+1HgUUnfA15RXzu9TvMZYANgKsO8R8j99y/symqe06atwr33Lmh3GG2T9qf9aX9z7Z88edKwN75HmkQW1f9LePJdsJY0zHsScKrtzw9nxrZvk/QnYEdgB+C7trsvI0REjGOtOLA+E3iXpLUBJE2RtEV97VLgnZKeImkFYK9e034XeC+wJ6V7KyIixpExTyK2L6ccH7lQ0nXADcDO9eWTgduAP1MSylW9Jv8hZS/kRtu3jXWsERExPJOWLevIHqL1gFtyTKQ7pf1pf9o/4mMi6/PESU4DT9PUkiIiIkgSiYiIEUgSiYiIpiWJRERE05JEIiKiaWOSRCQtk9TvZY+SptYr0Yc6v8Nqna2IiBhH2rUnMhUYchIBvgAkiUREjDMjLXsyIEmTgeMpVXoXAwttvxw4AZgqaTbwsO1tJX0S2KPGtIhS6Xe2pBPq7H4raSmwg+0HxjLuiIgYmjFNIpTCi68CNrS9VFJP4cWDgD/Y3rRh3NNtHwsg6bXAScDWtg+SdCCwre2FYxxvREQMw1gnkb9TSrzPkHQp8OMBxt1C0iHAasBS4EUjXXhKwXevtD/t72YTqRT8gGzPl7QRpf7Va4EvS9q893j1oPl5lHuOXCNpTeAfI11+yp50p7Q/7U/7W1cKfkwPrEuaBqxoexbwWcpdDZ8HPAisKKkniT2NktBur88P7DWrBcAzxjLWiIgYvrE+O2sd4Oe1eu/1wE+BK23PBc4E/ijpt7YfBD4PXCXpauChXvM5FrhU0mxJU8c45oiIGKJU8e1A2Z1P+9P+tL8ZqeIbEREtlSQSERFNSxKJiIimJYlERETTkkQiIqJp4zaJDFYJOCIi2m/cJpGIiBj/xnsS+XS9wNCS3t7uYCIi4snGexJZUiv9/j/gZEnPandAERHxhLGu4jtSMwBsW9I1wNbAhUOdOFV8u1fan/Z3s46p4ttuKXvSndL+tD/t75AqvqNgPwBJLwQ2A65sbzgREdFovO+JPEXStcCKwAds39PugCIi4gnjNonYnlQfHtbOOCIion/jvTsrIiLGsSSRiIhoWpJIREQ0LUkkIiKaliQSERFNSxKJiIimjSiJSDpM0nJNTruepPf3GnaRpOePJKaIiGidke6JfAHoM4lIGuwalPWAJyUR22+y/bcRxhQRES0y6MWGkpYBRwA7AysAh9g+X9IJdZTfSloK7AD8N/AYIGAVYFNJZ9bnywM3A++xPQ84AVhf0mzgZtu7SpoDvMX2DZJeAHwLmFbneYjti0en2RERMRqGesX6EtubShIlafzK9kGSDgS2tb0QoLzMpsD2th+q037U9n319SOBg4HPAgcBx9jesp9lngmcbHuGpA2ByyVtYPveoTYuVXy7V9qf9nez8VjFdzgl2c9rSCAA75K0N6XbayXgpsEWJmkVSjL6Tl3ujXWPZWtg5hBjThXfLpX2p/1p/8Su4ruw54Gk7YADgDfa3hg4FHjaGCwzIiLaYKhJpL+S7AuAZwww3VRgPnC/pOWB9zS89mB/09peAMwG9q3L3QDYhJSCj4gYV4bandVfSfZjgUslPUI5sN7bxcA+lC6s+4DLgZfV164HLOkG4C+2d+017d7AtyR9nHJg/Z3DOR4SERFjb9KyZQMfM6hnZ63Sc/B8glgPuCXHRLpT2p/2p/0jPiayPjBnSNM0taSIiAiG0J3VcHOoiIiIJ8meSERENG3c3h53NORiw4lj0eLHWPDgI+0OIyKGqaOTyP5HXsI98/LDNBHMPHZnuvdQaMTE1fLurJFU/o2IiPGlHcdE+q38GxERE0tLu7P6qPz7FeCjPJFUPmX7F5KeBfwe2NX2HyTtC7wP2MH2Y62MOSIi+tfSPRHbB9WH29reFJgFbG17M2AP4LQ63j3Au4GzJG1NKUW/ZxJIRMT40u4D688Hzpa0FvAosIakNWzfbfsySWcBvwbeZvv2tkYaY240zyibaGenjba0P+1vlXYnkbOBT9r+kaTJwMM8ucrvZsC9wNrtCC5aa7RKVaTsRdqf9k/sUvCDaaz8OxW4pT5+D+XuhwDUwotPBTYHDpa0aSuDjIiIwbUjifRU/p0NfAz4Ub3R1fOA+wEkvQz4CLCv7bsoB9XPqTerioiIcaLl3Vm2DwcObxh0RsPjQ+r/OZQqkj3T/Ax48ZgHFxERw9LuYyJjasahr293CDFEixbnxLuIiaijk0juJxIRMbZSxTciIprW0XsiqeLbvdL+0Wl/qivHYDo6iaSKb8TIpLpyDGbCdGdJ+q6kD7U7joiIeELbkoikjt4LiojoBq2u4ruMco3Im4GLJZ0LnAisRCl3crLt/67jrgWcDjyHct3I0lbGGhERg2vHnsgjtqfb/k9Kcnit7c2BlwHvl7RBHe844HLbGwIfArZvQ6wRETGAdnQpndbweEXgfyRtQtnTWBPYBPgz8CpK6RNs/13SL1odaERMzDPdJmLMo6nTq/gubHh8FHA38G7bj0m6hCdX8Y2INptoF652+8W23VDFt9FU4PaaQF4CbNfw2qXAfgCS1gde04b4IiJiAO0+Q+pI4AxJ+wM3AZc3vPZR4HRJe1HKxV/W+vAiImIgLU0itif1en4t8JJ+xv0H2fuIiBjX2t2dFRERE1i7u7PGVErBR4xMSvTHYDo6iaQUfHdK+7u7/dFa6c6KiIimJYlERETTkkQiIqJp4+aYSC3OuArwa2Ab27kRSETEODdukkgP25u2O4aIiBiatiURSbtQamctAs5vGN6zR/IwcDzwamAxsND2y9sQakRE9KMtSUTSs4FTgG1tW9Jn+hhtE0ol3w1tL5W06nCXk3usd6+0P+3vZp1exRdgK+Aa267PTwa+3GucvwNPBWZIuhT48XAXkutEulPan/an/d1TxbdftucDGwHnAC8F/iRpjfZGFRERjdqVRK4ENpP0wvr8vb1HkDQNWNH2LOCzwHzgea0LMSIiBtOWJGL7HuD9wExJ19L3jajWAX4u6TrgeuCnlOQTERHjRNvOzrL9Q+CHDYOOrP97ysVfA2zR0qAiImJYxu0xkYiIGP+SRCIiomlJIhER0bRxV/ZkNOViw4lv0eLHWPBgyqhFjFcdnUT2P/IS7pmXH6CJbOaxO9O9l41FjH/pzoqIiKYliURERNNa0p0l6UxAwPLAzcB7bM+T9EXgHcD9wGXAa2xvWafZFziwxjgfOKCh1lZERIwDrdoT+ajtLW1vDPwJOFjSTsBbKNV6twF6SqAgaTtgd+CVtrcAvgqc2qJYIyJiiFp1YP1dkvYGlgNWAm6qj8+1/RCApNOA/6zj70RJLr+TBOUq9mGXgo/O0MyZZp1ydlqz0v60v1XGPInUvYoDKPcOuVfSXpS6WQOZBJxq+/NjHV+Mf8Mta51S4Gl/2t9ZpeCnUo5p3C9peeA9dfhlwK6SVpQ0GXhnwzQzKXsvawNImiIpdbQiIsaZViSRi4G/UbqwfkkprIjtC4FZlAq9VwJ3UpINti8HPgdcWKv43gDs3IJYIyJiGMa8O8v2o5QzsPryRdufrXsi3wauaJjuTODMsY4vIiKa1+4r1k+XtB6wAnA18JXRnPmMQ18/mrOLNli0+LF2hxARA2hrErH9trGcf+6xHhExtnLFekRENC1JJCIimpYkEhERTUsSiYiIpo2LJCKp3WeJRUREE9r24y1pGXA48GbgYkl/A/YCHgBeCvwD+DBwDPAC4CpgH9vdd7pVRMQ41e49kUdsT7fdU3hxOvAJ2y8GHgHOoiSWDYGNgde0J8yIiOhLu7uRTuv1/De276iPrwXm2H4AoJY/eQHw86HOPPdY715pf9rfzTqqiu8gFvZ6vqjh8ZI+ng8r3lxs2J3S/rQ/7e+sKr4REdGhkkQiIqJpk5Yt68junvWAW9Kd1Z3S/rQ/7R9xd9b6wJwhTdPUkiIiIkgSiYiIEUgSiYiIpiWJRERE05JEIiKiaUkiERHRtCSRiIhoWpJIREQ0rSW1s2rZ988BbwNWBz5t+/z62huBLwFTgHuBD9i+WdI+lFLwr6DUzboEOM/2Sa2IOSIiBteSK9ZrEvmw7eMlvRw41/Zakp4F/AnY3vaNkvYH3m97qzrdDMr9ReYDL7G9+xAXuR5wy6g3JCKiOwz5ivVWVvE9p/6/ElhT0tOArYDrbN9YX/sOcKKkVWwvAD4EXA08FdhiuAtM2ZPulPan/Wl/Z1bxXQRge0l9PpQEtgawMrAc8PQxiisiIprU7gPrVwKbSHpxfb4vcK3tBZKWA74PfAY4DDgn92KPiBhf2ppEbN8LvBM4S9L1wD71D+ArwGzb59j+DuUYx5HtiTQiIvqSUvAdKH3CaX/an/Y3I6XgIyKipZJEIiKiaUkiERHRtI4+22m45zt3kmnTVml3CG2V9qf9nWzR4sdY8OAj7Q4D6PAksv+Rl3DPvPHxRkdEjJaZx+7MeDl1oCXdWZLmSHpJK5YVERGtk2MiERHRtFHvzpK0DfBVoKdT8tP1/+6STgGeAxxj+/g6/jHA9pTSJvcB77F9ay3OeBbw7Dr9z21/fLTjjYiI5o3qnoik1YALgM/Y3gTYHLiqvryi7W2AHYCjJfUc9T7a9vQ6/tnAl+vwvYG/2d7Y9sbAEaMZa0REjNxo74lsA9xo+7fweLHFeZKgVvG1PUfSPGBt4C/AjpIOohRabIznSuDjkr4K/BKYNcqxRkRMWAOdgdbKs9NaeXbWoobHS4CnSFoX+Dow3fYtkraldGFh+wpJmwGvo9TX+izlBlUREV2vv9ImE70U/BXAhvW4CJKmSFp1gPGfDvwLuFvSZOCDPS9IWh940PY5wCeALeo4ERExTozqj7LtucAuwNdqVd6rGeBmUrb/CPwAuBH4HU++G+EOwDWSZgM/BT5oe+loxhsRESPT0VV8c7FhRHSimcfuPNbdWePy9rgtN+PQ17c7hIiIUbdo8WPtDuFxHZ1Ecj+R7pT2p/3d3P5Wy4HqiIhoWpJIREQ0LUkkIiKaNqIkImm2pBWamC5VfSMiOsCIDqzb3nS0AomIiIlnRElE0jJgFdsLJc0BTqeUKeldqXc74MQ62S+BSX3No/E5sBQ4DdgIeBSw7d1HEm9ERIyu0T4m8m+VeiUtTym++OFajfdy4LlDmNcbgKfb3rBW+P3AKMcaEREjNNrXifRVqXc54GHbl9XXzpV08hDmdR2wgaQTgMuAnww3mNxjvXul/Wl/N5vIVXz/rVJvP+Mt6zXeZABJT+sZaPvvkjYCXgPsCBwlaWPbixiiXGzYndL+tD/tn7hVfPtiYIV6XARJuwJTG16/GZheH+/VM1DS2sAS2z8CPg5MA1ZrQbwRETFEY55EbC8G9gROrJV9dwBuaxjlE8C3JF1NSRQ9NgaukHQd8HvgS7bvHOt4IyJi6Dq6im+6s7pT2p/2p/2tq+KbK9YjIqJpSSIREdG0JJGIiGhakkhERDSto29KlYsNu1fan/aPpUWLH2PBg7n1NnR4Esk91iNiLMw8dme69/yvJ0t3VkRENC1JJCIimtb27ixJbwKOahi0IbAbcDjwO2AbSq2tPWz/ufURRkREf9qeRGxfBFwEIOl9wH7AYsp9RPaz/QFJnwMOBfZuW6AREQ3G88kLE7mKb9MkvYFSR2s74CWUm1BdW1++EtipXbFFRPQ2XkurdGIV30FJ2gQ4CdjZ9n118FDLykdERJu0PYlIWgs4H9jH9k3tjiciIoZuPGzdv5dSAv4EST3DPt6+cCIiYqjankRsH045E6u3LRvGuazxeUREjA9tTyJjacahr293CBHRgRYtfqzdIYwbHZ1EclOq7pT2p/3d3P5Wa/uB9YiImLg6+va47Q4iIqLV/vXoEuY/8HBT0zZze9yO7s5KFd+I6DYzj925pctrW3eWpC0lndmu5UdExMi1ZU9E0lNs/4HUwoqImNDGJIlI2gb4KtBTBezTwMnAOcCrgT9KOgM4xvaWktYD/gCcArwRWIGSYD4IbAU8QimJcvdYxBsREc0Z9e4sSasBFwCfsb0JsDlwVX356bZfZnv/PiZdHfi17c2AGcAvgBNsvxS4GvjQaMcaEREjMxZ7ItsAN9r+LYDtJcC8WtLk9AGmW2j7J/XxNcAdtmfX51cDrxuDWCMiOk4nl4JfOMBrixseLyFVfCMimjLRS8FfAWxYj4sgaYqkVcdgORER0WajnkRszwV2Ab4m6XpKV9QWo72ciIhov46+Yj0XG0ZEt5l57M6j0Z015CvWOzqJtDuIiIhWS9mTUZQqvt0p7U/7u739rZQqvhER0bQkkYiIaFqSSERENC1JJCIimpYkEhERTUsSiYiIpiWJRERE0zr1OpEpUC6c6Vbd3HZI+9P+tH+E000Z6jSdesX6K4BftTuIiIgJajvg10MZsVOTyPLAdOAuShn5iIgY3BTgOZQbCS4eZFygc5NIRES0QA6sR0RE05JEIiKiaUkiERHRtCSRiIhoWpJIREQ0LUkkIiKaliQSERFN67iyJ5JeBJwGrA7cD7zL9l/bG9XYkXQM8HbKfeU3tn1DHd7x74Ok1YEzgOcD/wL+CnzA9r2Stga+BaxAuVf0PrbvaVesY0XSjyj3w14KLAQ+bHt2N6z/HpK+ABxG/fx30bqfAyyqfwAH257V6vZ34p7IScAJtl8EnEB5MzvZj4BXArf2Gt4N78My4Cu2ZXtj4G/A0ZImA98DDqrtvxw4uo1xjqV9bW9iezPgGODUOrwb1j+SNge2pn7+u2zdA+xqe9P6N6vZWPXyAAACOUlEQVQd7e+oJCLpWcDmwNl10NnA5pKmtS+qsWX717ZvbxzWLe+D7bm2L2sYdCWwLrAFsMh2T+2fk4DdWxxeS9ie3/D0GcDSbln/kpanJMgDGgZ3zbrvR8vb31FJBFgH+IftJQD1/511eDfpuvehboEdAFwIPJeGPTPb9wGTJa3WpvDGlKRvS7oN+CKwL92z/o8Avmd7TsOwrlr3wJmSrpd0oqSptKH9nZZEont9k3JM4Ph2B9Jqtt9r+7nAIcBX2x1PK0jaBtgSOLHdsbTRdrY3oRSbnUSbPvudlkRuB9aSNAWg/l+zDu8mXfU+1JMLXgi8w/ZS4DZKt1bP688Eltqe26YQW8L2GcCrgDvo/PW/PbABcEs9wLw2MAt4AV2y7nu6sW0vpiTTl9OGz35HJZF6BsJsYM86aE/gWtv3ti+q1uum90HSUZR+4LfWLxPA1cAKkl5Rn38Q+EE74htLklaWtE7D852AuUDHr3/bR9te0/Z6ttejJM43UPbEumHdryTpGfXxJGAPyjpv+We/40rBS3ox5dTGVYF5lFMb3d6oxo6k44BdgDWA+4D7bW/UDe+DpI2AG4CbgEfq4Ftsv03StpQzkp7GE6c5/rMtgY4RSc8G/hdYiXLfnLnAp2xf0w3rv1HdG3lLPcW3G9b984DzKff/mALcCHzE9l2tbn/HJZGIiGidjurOioiI1koSiYiIpiWJRERE05JEIiKiaUkiERHRtCSRiIhoWpJIREQ0LUkkIiKa9v8BQ74TY7Bp6IgAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset rows count BEFORE dropping rows with missing values: 556\n",
      "\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZEAAAELCAYAAAAY3LtyAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAIABJREFUeJzt3XncrXO9//HX3juJbG1pV6ZshffJkHEXojT9Sh0/1UHGRJGpSahTncJRadAgHEMUIol0kJAklSHTpqi3DDuE0zZmm4723ueP63tnubuHdV9r3Wu47/fz8diPvda6ps/3utdan+v7va7rs6YsWrSIiIiIOqZ2O4CIiOhfSSIREVFbkkhERNSWJBIREbUliURERG1JIhERUVuSSAAgaa6kN3c7jl4n6V2S7pI0X9J63Y6nkyTtKOmiHojjZWX/T2thHfMlvbydcbWqXz+Dz+l2AN0maS7wEuDvwALgZuBk4DjbC5tYfhZwB7CY7b+PY5wd2c5EJmlz4Hu2V2xhNV8F9rX932Wdc4EP2L645QB7nO1TgVN7II47gaVaXEdLy8cz0hOpbGl7OrAycBjwCeCE7oYUPWpl4KZ2rEjSFEkjfgYlTfoDvehteYM2sP0IcI6k+4ArJR1u+/eS3gEcCrwCeAQ4wfZBZbHLyv8PSwJ4C/BX4HhgHWARcCGwj+2HASR9AvgwsDRwD7C37Z+XL5QDgd2BGcDPgT1tPzjUdmxfMRC7pOWB24AVyvyU4ZafAcsBLxsppkaSvgvcbfsz5fnmNBzBl219C3gdMB/4uu0jyrRXA0cDqwNPAKfa3m+o/S1pK+Bg4OXAvBLPBWX9xwCbAg8CX7J9fJOxzQWOBN5L9YV/AbALMA34KbC4pPklhNVt3zMopiH/1pIWBx4o67mhvEcuL/v1XEkLgENsf1nSRsDXgDWAPwMfsX1pWf+lwG+AzYH1gbWBWwfFMBf4L2DH6qmeD7x4hH1+ELAm8BSwFTAX+Lfy72Pl9ffbvqjMP+T+beI9tCNVr2vTMm0RsBfwcWAmVS9lX9uLylDTl8u+fxQ4vMQ/ZE+6tPkoYOey708HPgV8t8R5FbCN7YcG98olvQ/4bInhfuAztk+VtCrVweC6wNPAz22/pyH21WzfWt5TjwGzyv69GdjB9m1l3v9XYn9paeOawCm2vz2oDX33GWyH9ESGYPu3wN3AZuWlx6i+lGYA7wD2kvTOMu115f8ZtpcqX+xTgC8CywOvBFYCDoLqGwHYF5hdej9vpfrQA3wIeCfw+rLsQ1QfrOG20xjzPcAVVF8cA3YAzrT99EgxjUVJdOcCNwArAG8CPirprWWWbwLftL001ZfBGcOs59VUw4YHUO3X1/HMfjidav8vD2wNfEHSG8cQ5rbA24BVgFcB77P9GLAFcE/Zf0sNTiDFkH9r2081DIGsY/sVtncG7qTqyS5VEsgKwE+oEtELgf2BsyTNbNjGzsAewHSqJDOU7cv2ZwALGXmfA2wJnAIsA1xP9QU1tcx/CHBsw7xD7t8m3kND+VdgNtV+3pbq/QzVgdAWVF/g61O9r0fzb1QHYauX9vyUKpHMLG358OAFSoI9AtiifJ42AeaUyf8JXES1T1ak+tIdznZUBzTLUCX1z5f1vwg4E/h3YFnAZRv/pN8+g+2Snsjw7qH6EmDgKLK4UdL3qb7ofzzUgrZv5Zmjy3mSvgZ8rjxfACwOrCFpnu25DYvuSXUkdzf84wjzTkk7NxnzaVRv2uMlTaH6YOzYRExjMRuYafuQ8vx2SceXbV1IdcS3qqQX2b4fuHKY9bwfONH2z8rzvwBIWgl4LfAO208CcyR9m+qL/ZImYzxiIEFIOpfqi6wpY/1bD2En4Hzb55fnP5N0DfB24KTy2ndtjzYkdoTtuwAkvYaR9znAr2xfWOb/IfBu4DDbCySdDhwnaQZV4hpp/w77HhrGYeVI+mFJv6Da1xdQJZRvNryXD6P6shvJt2z/T5n/V8BfbV9fnp89wvILgbUk3Wn7XuDe8vrTVL3R5Uscvx5h22eXg0cknUrVk4Tq73aT7R+VaUdQHRgMp58+g22RJDK8Fai6+gMf4sOAtYDnUiWBHw63oKSXUB0NbEb1oZ1K1augdJ8/SnUEsqakC4H9ypfeysDZkhpP6C+gOvHfjLOAb0lajupobiHwq9FiGqOVgeUlNXbBpw1shyo5HAL8UdIdwMG2zxtiPSsB5w/x+vLAg7YfbXjtz8CGY4jxvobHj5d1NmWsf+shrAxsI2nLhtcWA37R8PyuJtbTOM9o+xzgfxoePwHcb3tBw3OoTkaPtn+HfQ8NY/C+HuitLT+oDc20eXAbBj//p5Phth+T9B6qL/YTJP0G+LjtP1INDf8n8FtJDwGH2z6xlXaUobq7R2hDP30G2yJJZAiSZlMlkYEjl9Ooxtm3sP2kpG8ALyrThiqD/IXy+tq2HyxDX0cOTLR9GnCapKWphhm+RDXEcRewm+3fDBHTyqPFXcaLLwLeQ9VdPt32QHwjxjTIY8CSDc9f2vD4LuAO26sNE8OfgO1Ll/vdwJmSli3DSY3uoupqD3YP8EJJ0xu+6F5G6amMEttomilZPdLfupl13kU1Xr57i3E0zjPiPh+jEffvKO+hsbiXaghpwEotxDyi0gO7UNISVMOIxwOb2b6PalgNSZsCF0u6rPQImvWsdpTexbBX9/XZZ7AtkkQalC/111EdLXzP9u/KpOlUR29PlrH8HajGWqE6IbyQ6uTwLQ3zPwI8UsbID2jYhqgS1G+AJ6mOsAaudz8G+LykXWz/uYyjb+LqctKhtjOU06iuLlsZaDyPMGxMQ5gDfFzSoVRH4x9tmPZb4FFVFwccAfwv1YdlCdtXS9oJuND2vIYjpaEulT4BuEjSeVRH6csB023/UdLlwBcl7U91NPd+nhlSGSm20fwPsKykF7i6iGIoI/2th1tn4/0G3wOuLuPTF1P1QjYCbh0Y2qlhxH0+lhXZvmuU/QvDv4fG4gzgI5J+QvWF+Ima6xlRObrfiGpfP0F1knlhmbYNcEXZ7w9RfYGPetn+ID8Bjixf+OdRDTmPduDSL5/BtsiJ9cq5kh6lyvCfphoP3bVh+t7AIWWez9Jwosr241Qn4X4j6WFVV+YcTHUy8RGqN+GPGta1ONVwyf1UXegXU520gyp5nUP15foo1Vjma0bYzlDOAVYD7rN9Q8PrI8U02ClUJ+3mUn2B/qChvQuoTqauS3WFzP3At4EXlFneBtyk6gqobwLb2X6CQcr4867A10tMv6T60EF1UnkW1VHz2cDn/Mx9GMPGNpoyxPF9qjHkh1Vd4TLYsH/rYXwR+ExZ3/7lPMZWVCeE51G9pw6ghc9aE/t8rEbavzD8e2gsjqf6+9xIdaL/fJ65F6udpgL7UbXlQarzV3uVabOBq8p78Ryqq+RuH8vKyzmFbaiuNHuA6oq7a6iueBtOX3wG22VKfpQqIsabpC2AY2yPOizby8oQ0d3AjrZ/Mdr8k0GGsyKi7cr5iTdQHUW/hOoqpLO7GlRNZWjyKqrhsgOoLtUd1yue+kmGsyJiPEyhGr55iGo46w9Uw4P9aGOqmwjvp7p/5Z3jOTzUbzKcFRERtaUnEhERtU3UcyKLU12ZcS/tvxokImKimkZ1uf3VjHwF2j9M1CQym5Hvso2IiOFtxshlYv5hoiaRewEeeugxFi7sn3M+yy67FA88MH/0GSeQtHlySJv7w9SpU1hmmefDM/XHRjVRk8gCgIULF/VVEgH6Lt52SJsnh7S5rzR9GiAn1iMiorYkkYiIqC1JJCIiaksSiYiI2pJEIiKitiSRiIioLUkkIiJqSxKJiIjakkQiIqK2nkoikt4n6cxuxxEREc3pqSQSERH9pXbtLEmLgM8A7wSWBXYH3kz1I/GLAdvY/kOZdxdg77K9R4C9bFvSc4FvAW+k+tWw6xvW/ydg64Efupe0L7CB7V3rxhwREe3VagHGh23PlrQN8N/Adrb/XdKBwKeBnSRtBmwLvM72U5K2AE4EXgt8EFgFWIMq8VwGzC3rPgnYBdivPN8V+NhYglt22aVaaVtXzJw5vdshdFzaPDmkzRNTq0nkB+X/64BFts8rz68F3l0ebwmsA1wlCarfXl6mTHsDcJLtp4GnJX0P2LRMO7kscyDwSmAGY/yNkAcemN9XVTRnzpzOvHmPdjuMjkqbJ4e0uT9MnTplzAffrSaRJ8v/C3j2r2AtaFj3FOBE258dy4pt3ynpJmALYHPgu7b7JyNEREwCnTixfi7wXkkrAkiaJmmDMu0SYGdJz5G0BLDDoGW/C3wA2J5qeCsiInrIuCcR25dRnR85R9INwO+Brcrk44A7gT9QJZSrBy3+I6peyM227xzvWCMiYmymLFo0IUeIZgF35JxI70ubJ4e0uT80nBNZhWcuchp5mfEMKCIiJrYkkYiIqC1JJCIiaksSiYiI2pJEIiKitnFJIpIWSRr2tkdJM8qd6M2u76BSZysiInpIt3oiM4CmkwjwOSBJJCKix7Ra9mREkqYCR1JV6X0KmG/7tcBRwAxJc4DHbW8i6ePAdiWmJ6kq/c6RdFRZ3eWSFgKb2354POOOiIjmjGsSoSq8+AZgDdsLJQ0UXtwHuMb2ug3znmz7cABJbwaOATayvY+kvYFNbM8f53gjImIMxjuJ3E5V4v0ESZcA540w7waSPgW8EFgIrN7qxlMKvj+kzZND2jwxjWsSsf2IpDWp6l+9GfiSpPUHz1dOmp9J9Zsj10laHvhLq9tP2ZPelzZPDmlzf6hTCn5cT6xLmgksaftC4JNUv2r4cuBvwJKSBpLY86gS2l3l+d6DVvUo8ILxjDUiIsZuvK/OWgm4uFTvvRH4KXCl7QeBU4HfSbrc9t+AzwJXS7oWeGzQeg4HLpE0R9KMcY45IiKalCq+PaQfu7+tSpsnh7S5P6SKb0REdFSSSERE1JYkEhERtSWJREREbUkiERFRW88mkdEqAUdERPf1bBKJiIje1+tJ5IByg6El/Vu3g4mIiGfr9SSyoFT6/f/AcZJe3O2AIiLiGeNdxbdVJwDYtqTrgI2Ac5pdOFV8+0PaPDmkzRNTryeRlqTsSe9LmyeHtLk/9FwV3zbYFUDSasB6wJXdDSciIhr1ek/kOZKuB5YEPmj7r90OKCIintGzScT2lPLwoG7GERERw+v14ayIiOhhSSIREVFbkkhERNSWJBIREbUliURERG1JIhERUVtLSUTSQZKeW3PZWZL2GPTa+ZJe0UpMERHROa32RD4HDJlEJI12D8os4FlJxPbbbd/WYkwREdEho95sKGkRcAiwFbAE8CnbZ0k6qsxyuaSFwObAN4C/AwKmA+tKOrU8Xxy4FdjN9kPAUcAqkuYAt9reWtJc4F9t/17SqsCxwMyyzk/ZvqA9zY6IiHZo9o71BbbXlSSqpPEr2/tI2hvYxPZ8gGoy6wKvt/1YWfYjtu8v0w8FPgF8EtgH+KrtDYfZ5qnAcbZPkLQGcJmkV9qe12zjUsW3P6TNk0PaPDE1m0TGUpL9zIYEAvBeSTtSDXs9H7hltI1Jmk6VjL5Ttntz6bFsBJzbZMyp4tsH0ubJIW3uD71SxXf+wANJmwF7AW+zvTbwGeB547DNiIjogmaTyHAl2R8FXjDCcjOAR4AHJC0O7NYw7W/DLWv7UWAOsEvZ7iuBdUgp+IiIntLscNZwJdkPBy6R9ATVifXBLgB2ohrCuh+4DHh1mXYjYEm/B/5oe+tBy+4IHCvpY1Qn1ncey/mQiIgYf1MWLRr5nEG5Omv6wMnzPjELuCPnRHpf2jw5pM39oeGcyCrA3KaWGc+AIiJiYht1OKvhx6EiIiKeJT2RiIioLUkkIiJqSxKJiIjaOp5EWqn8GxERvaUbPZFhK/9GRER/afZmw7YYovLvl4GP8ExS2d/2zyW9GPgtsLXtayTtAuwObG77752MOSIihtfRnojtfcrDTWyvC1wIbGR7PWA74KQy31+B9wGnSdqIqhT99kkgERG9ZdQ71tut8Q54Sa8GDgVWAJ4G1gJWtH1fmfcgqqKN77LddPVeyh3r7Yw7ImISafqO9Y4OZw3h+8DHbf9Y0lTgcZ5d5Xc9YB6wYp2Vp+xJ70ubJ4e0uT/0Sin40TRW/p3BMz2G3ah+/RCAUnhxMWB94BOS1u1kkBERMbpuJJGByr9zgI8CPy4/dPVy4AGAMsz1YWAX2/dSnVQ/vfxYVURE9IiOnxPpkFmkim9fSJsnh7S5P6SKb0REdFSSSERE1JYkEhERtSWJREREbUkiERFRW98kEUnflbRvt+OIiIhndC2JSOr23fIREdGiTlfxXQQcDLwDuEDSGcDRwPOpyp0cZ/sbZd4VgJOB5aiuV17YyVgjImJ03eiJPGF7tu3/oEoOb7a9PvBqYA9JryzzHQFcZnsNYF/g9V2INSIiRtCNIaWTGh4vCfyXpHWoehrLA+sAfwDeQFX6BNu3S/r5WDc01kJivWDmzMlX2SVtnhzS5ompG0lkfsPjLwD3Ae+z/XdJF/HsKr4tSdmT3pc2Tw5pc3/olyq+jWYAd5UEshawWcO0S4BdASStArypC/FFRMQIun2F1KHAKZLeD9wCXNYw7SPAyZJ2oCoXf2nnw4uIiJF0NInYnjLo+fVUv2Y41Lx/Ib2PiIie1u3hrIiI6GNJIhERUVuSSERE1JYkEhERtSWJREREbUkiERFRW7fvE/mHUpxxOvBrYGPbT3Q5pIiIGEXPJJEBttftdgwREdGcriURSe+mqp31JHBWw+sDPZLHgSOBNwJPAfNtv7YLoUZExDC6kkQkvQQ4HtjEtiUdOMRs61BV8l3D9kJJy4x1O6ni2x/S5skhbZ6YutUTeQ1wnW2X58cBXxo0z+3AYsAJki4BzhvrRlLFt/elzZND2twf+rGK77BsPwKsCZwOvAq4SdJLuxtVREQ06lYSuRJYT9Jq5fkHBs8gaSawpO0LgU8CjwAv71yIERExmq4kEdt/BfYAzpV0PUP/ENVKwMWSbgBuBH5KlXwiIqJHdO3qLNs/An7U8NKh5f+BcvHXARt0NKiIiBiTnj0nEhERvS9JJCIiaksSiYiI2pJEIiKitiSRiIioLUkkIiJqSxKJiIjaOnKfiKRTAQGLA7cCu9l+SNLngfcADwCXAm+yvWFZZhdg7xLjI8BeDbW2IiKiB3SqJ/IR2xvaXhu4CfiEpC2Bf6Wq1rsxMFACBUmbAdsCr7O9AfAV4MQOxRoREU3q1B3r75W0I/Bc4PnALeXxGbYfA5B0EvAfZf4tqZLLVZKguos9peAnqLR5ckibJ6ZxTyKlV7EX1W+HzJO0A1XdrJFMAU60/dlWtp1S8L0vbZ4c0ub+0Kul4GdQndN4QNLiwG7l9UuBrSUtKWkqsHPDMudS9V5WBJA0TVLqaEVE9JhOJJELgNuohrB+SVVYEdvnABdSVei9EriHKtlg+zLg08A5pYrv74GtOhBrRESMwbgPZ9l+muoKrKF83vYnS0/k28AVDcudCpw63vFFRER9XSsFX5wsaRawBHAt8OXuhhMREWPR1SRi+13d3H5ERLQmd6xHRERtSSIREVFbkkhERNSWJBIREbX1RBKR1O2rxCIiooaufXlLWgQcDLwDuEDSbcAOwMPAq4C/AB8CvgqsClwN7GS7f+qYRERMcN3uiTxhe7btgcKLs4H9bP8L8ARwGlViWQNYG3hTd8KMiIihdHsY6aRBz39j++7y+Hpgru2HAUr5k1WBi5tdear49oe0eXJImyembieR+YOeP9nweMEQz8cUb6r49r60eXJIm/tDr1bxjYiICSpJJCIiapuyaFH/DPeMwSzgjgxn9b60eXJIm/tDw3DWKsDcppYZz4AiImJiSxKJiIjakkQiIqK2JJGIiKgtSSQiImpLEomIiNqSRCIiorYkkYiIqK0jtbNK2fdPA+8ClgUOsH1WmfY24IvANGAe8EHbt0raiaoU/KZUdbMuAs60fUwnYo6IiNF15I71kkQ+ZPtISa8FzrC9gqQXAzcBr7d9s6T3A3vYfk1Z7gSq3xd5BFjL9rZNbnIWcEfbGxIRMTk0fcd6J6v4nl7+vxJYXtLzgNcAN9i+uUz7DnC0pOm2HwX2Ba4FFgM2GOsGU/ak96XNk0Pa3B96vYrvkwC2F5TnzSSwlwJLAc8Flh6nuCIioqZun1i/ElhH0r+U57sA19t+VNJzgR8ABwIHAafnt9gjInpLV5OI7XnAzsBpkm4Edir/AL4MzLF9uu3vUJ3jOLQ7kUZExFBSCr6H9OMYaqvS5skhbe4PKQUfEREdlSQSERG1JYlERERtSSIREVFbkkhERNTWkSQiaa6ktTqxrYiI6Jz0RCIiora23wEuaWPgK8D08tIB5f9tJR0PLAd81faRZf6vAq+nKm1yP7Cb7T+X4oynAS8py19s+2PtjjciIupra09E0guBs4EDba8DrA9cXSYvaXtjYHPgMEkDVb4Osz27zP994Evl9R2B22yvbXtt4JB2xhoREa1rd09kY+Bm25fDP4otPiQJShVf23MlPQSsCPwR2ELSPlSFFhvjuRL4mKSvAL8ELhxrMGOtRtkLZs6cPvpME0zaPDmkzRNTJwsaPtnweAHwHEkrA18HZtu+Q9ImVENY2L5C0nrAW6jqa32S6geqmpayJ70vbZ4c0ub+0Aul4K8A1ijnRZA0TdIyI8y/NPC/wH2SpgJ7DkyQtArwN9unA/sBG5R5IiKiR7T1S9n2g8C7ga+VqrzXMsKPSdn+HfBD4GbgKp79a4SbA9dJmgP8FNjT9sJ2xhsREa1JFd8e0o/d31alzZND2twfUsU3IiI6KkkkIiJqSxKJiIjakkQiIqK2JJGIiKitpSQiaY6kJWosl6q+ERETQEt3rNtet12BRERE/2kpiUhaBEy3PV/SXOBkqjIlgyv1bgYcXRb7JTBlqHU0PgcWAicBawJPA7a9bSvxRkREe7X7nMg/VeqVtDhV8cUPlWq8lwEva2JdbwWWtr1GqfD7wTbHGhERLWp3AcahKvU+F3jc9qVl2hmSjmtiXTcAr5R0FHAp8JOxBpMqvv0hbZ4c0uaJqd1J5J8q9Q4z36JB800FkPS8gRdt3y5pTeBNwBbAFyStbftJmpSyJ70vbZ4c0ub+0AtVfIdiYIlyXgRJWwMzGqbfCswuj3cYeFHSisAC2z8GPgbMBF7YgXgjIqJJ455EbD8FbA8cXSr7bg7c2TDLfsCxkq6lShQD1gaukHQD8Fvgi7bvGe94IyKieani20P6sfvbqrR5ckib+0Oq+EZEREcliURERG1JIhERUVuSSERE1JYkEhERtSWJREREbUkiERFRW5JIRETU1u7aWWMm6e3AFxpeWgPYBjgYuArYmKrW1na2/9D5CCMiYjg9dce6pN2BXYFDgHOBV9u+XtKngTVs79jkqmYBd4xPlBERE17Td6x3vScyQNJbqepobQasRfUjVNeXyVcCW451nSl70vvS5skhbe4PvVrFd1SS1gGOAbayfX95udmy8hER0SVdTyKSVgDOAnayfUu344mIiOb1wtH9B6hKwB8laeC1j3UvnIiIaFbXk4jtg6muxBpsw4Z5Lm18HhERvaHrw1kREdG/kkQiIqK2JJGIiKgtSSQiImpLEomIiNq6lkQkbSjp1G5tPyIiWteVS3wlPcf2NUCztbAiIqIHjUsSkbQx8BVgennpAOA44HTgjcDvJJ0CfNX2hpJmAdcAxwNvA5agSjB7Aq8BnqAqiXLfeMQbERH1tH04S9ILgbOBA22vA6wPXF0mL2371bbfP8SiywK/tr0ecALwc+Ao268CrgX2bXesERHRmvHoiWwM3Gz7cgDbC4CHSkmTk0dYbr7tn5TH1wF3255Tnl8LvGWsgYy1GmUvmDlz+ugzTTBp8+SQNk9MnT4nMn+EaU81PF5AG6r4phR870ubJ4e0uT/0Sin4K4A1ynkRJE2TtMw4bCciIrqs7UnE9oPAu4GvSbqRaihqg3ZvJyIiuq+nfh63jWYBd2Q4q/elzZND2twfGoazmv553NyxHhERtSWJREREbUkiERFRW5JIRETUliQSERG1JYlERERtSSIREVFbkkhERNSWJBIREbV15UepOmAaVHdf9pt+jLlVafPkkDb3voZ4pzW7zEQte7Ip8KtuBxER0ac2A37dzIwTNYksDswG7qUqIx8REaObBixH9UOCT40yLzBxk0hERHRATqxHRERtSSIREVFbkkhERNSWJBIREbUliURERG1JIhERUVuSSERE1DZRy570LElLAt8BNgD+Duxv+7xh5t0d+AQwBfgp8GHbCxumPw+4FnjC9objHXtd7WizpK2Az1LdSDoFONH24Z2Iv1mSVgdOApYFHgDea/tPg+aZBhwBvA1YBBxm+9ujTetVbWjzfwDbUd0U/DTwKdsXdq4FY9NqexvmEXA9cLTt/TsR+3hJT6Tz9gf+ZntVYEvg25KWGjyTpFWAzwEbA6uVfzsNmu3zwJXjG25btKPN9wFb2l4L2ATYS9JmnQh+DI4BjrK9OnAUcOwQ8+wIrErVto2BgyTNamJar2q1zb8FZtt+FbAb8ANJS4x71PW12t6BJHMs8ONxj7YDkkQ67z2UN145grkG2GKI+bYGfmx7Xul9HF+WBaB8ga4GnDLuEbeu5Tbbvsr2PeXxI8AfgJU7EHtTJL0YWB/4fnnp+8D6kmYOmvU9wPG2F9qeR/VFsk0T03pOO9ps+0Lbj5f5bqTqZS477sHX0Ka/McAngfOAW8Y55I5IEum8lwF/bnh+J7DSWOaT9HzgG8Be4xRju7Xc5kaS/gXYCLikjTG2aiXgL7YXAJT/7+Gf4x+pjc3up17RjjY3ei9wm+27xyHWdmi5vZLWAd4KfH3co+2QnBNpM0nXUb2JhvKSNm3mK1Rd6r9IWq1N66ytQ20e2NZywH8Dew/0TKL/SXo98J/AW7ody3iRtBhwHLCr7QXVaZH+lyTSZrbXH2m6pDuphmHmlZdeBvxiiFkH5qNhvrvK402Bt0v6LPA8YBlJN5Zx5Y7rUJsHhhMuBr5s+4etxDwO7gJWkDStfEFMA5anIf5ioI1Xl+eNR60jTetF7WgzkjYGvgdsZdvjH3ZtrbZ3OeAxUTzdAAABNklEQVQVwPklgcwApkha2vYenWjAeMhwVuf9EPggQOlFzAYuGGK+s4B3SpopaSqwO3AGgO1X2Z5lexbVlS2/61YCaVLLbZa0LPAz4EjbJ3Qk6jGw/VdgDrB9eWl74PoyJt7oh8DukqaWsfR3Amc2Ma3ntKPNkmYDPwC2tn1dZyKvp9X22r7T9osaPrvfoDp30rcJBJJEuuErwAxJt1KdXNvD9qMAkg6RtCeA7dupuvdXAn8Cbqc6WutH7WjzJ4HVgQ9KmlP+7drhdoxmT+BDkm4BPlSeI+l8SQOXYJ9C1a4/UbXzENt3NDGtV7Xa5qOBJYBjG/6ua3e0BWPTansnnPyeSERE1JaeSERE1JYkEhERtSWJREREbUkiERFRW5JIRETUliQSERG1JYlERERtSSIREVHb/wHWHBe6YJXsWwAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset rows count AFTER dropping rows with missing values: 506\n"
     ]
    }
   ],
   "source": [
    "total_rows = data.shape[0]\n",
    "total_rows_without_missing_values=data.count()\n",
    "missing_values_count=(total_rows - total_rows_without_missing_values)\n",
    "plt.barh(data.columns, missing_values_count)\n",
    "plt.title(\"Dataset values count before removing missing values\")\n",
    "plt.show()\n",
    "\n",
    "print(\"Dataset rows count BEFORE dropping rows with missing values:\", data.shape[0])\n",
    "# Remove rows with columns missing data \n",
    "if sum(missing_values_count) > 0:\n",
    "    data = data.dropna()\n",
    "\n",
    "print(\"\")\n",
    "\n",
    "# Check the dataset after dropping rows\n",
    "total_rows = data.shape[0]\n",
    "total_rows_without_missing_values=data.count()\n",
    "missing_values_count=(total_rows - total_rows_without_missing_values)\n",
    "plt.barh(data.columns, missing_values_count)\n",
    "plt.title(\"Dataset values count after removing missing values\")\n",
    "plt.show()\n",
    "print(\"Dataset rows count AFTER dropping rows with missing values:\", data.shape[0])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "kP4yEElUGNhi"
   },
   "source": [
    "### Deal with too much data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "4j-0oInYclob"
   },
   "source": [
    "|Type of problems |Technique to use|\n",
    "|-----------------------------|---------------------------|\n",
    "|needle in a haystack problems |Step1: group data + histogram - to identify the disproportion|\n",
    "|(huge dataset with disproportionate|Step 2: Undersampling the classes to remove data|\n",
    "|class distribution: e.g. we try to detect |Step 3: Oversampling by adding more data|\n",
    "|data (horse rolling which is a rare | |\n",
    "|event vs simply standing or lying) | |\n",
    "|----------------------------------------------------------------|--------------------------------------------------------------------------------------------------|\n",
    "|.| Step 1: Manage at the training stage (adjust hyperparameter) |\n",
    "|.| (check ML Mastery for more techniques in the google docs) |\n",
    "|----------------------------------------------------------------|--------------------------------------------------------------------------------------------------|\n",
    "| dataset with class overload problems | Group together sparse categories |\n",
    "|(column with astronomical number      | Remove sparse categories |\n",
    "|of categories. e.g. city in house prices)| Summarising categories into higher levels of abstractions |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preparatory questions to ask"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do we have those problems to fix ?\n",
    "\n",
    "- outliers\n",
    "- missing data\n",
    "- class overload\n",
    "- too many features\n",
    "- unbalanced dataset\n",
    "- have we removed or balanced any existing bias in the dataset?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "5ZVFw4kaHVtO"
   },
   "source": [
    "### Please refer to the [Slides](http://bit.ly/do-you-know-your-data) for the step here after."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "Data Preparation (Do we know our data).ipynb",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
